Skip to content

Commit 2d29c9f

Browse files
TheJenaaxboe
authored andcommitted
block, bfq: improve asymmetric scenarios detection
bfq defines as asymmetric a scenario where an active entity, say E (representing either a single bfq_queue or a group of other entities), has a higher weight than some other entities. If the entity E does sync I/O in such a scenario, then bfq plugs the dispatch of the I/O of the other entities in the following situation: E is in service but temporarily has no pending I/O request. In fact, without this plugging, all the times that E stops being temporarily idle, it may find the internal queues of the storage device already filled with an out-of-control number of extra requests, from other entities. So E may have to wait for the service of these extra requests, before finally having its own requests served. This may easily break service guarantees, with E getting less than its fair share of the device throughput. Usually, the end result is that E gets the same fraction of the throughput as the other entities, instead of getting more, according to its higher weight. Yet there are two other more subtle cases where E, even if its weight is actually equal to or even lower than the weight of any other active entities, may get less than its fair share of the throughput in case the above I/O plugging is not performed: 1. other entities issue larger requests than E; 2. other entities contain more active child entities than E (or in general tend to have more backlog than E). In the first case, other entities may get more service than E because they get larger requests, than those of E, served during the temporary idle periods of E. In the second case, other entities get more service because, by having many child entities, they have many requests ready for dispatching while E is temporarily idle. This commit addresses this issue by extending the definition of asymmetric scenario: a scenario is asymmetric when - active entities representing bfq_queues have differentiated weights, as in the original definition or (inclusive) - one or more entities representing groups of entities are active. This broader definition makes sure that I/O plugging will be performed in all the above cases, provided that there is at least one active group. Of course, this definition is very coarse, so it will trigger I/O plugging also in cases where it is not needed, such as, e.g., multiple active entities with just one child each, and all with the same I/O-request size. The reason for this coarse definition is just that a finer-grained definition would be rather heavy to compute. On the opposite end, even this new definition does not trigger I/O plugging in all cases where there is no active group, and all bfq_queues have the same weight. So, in these cases some unfairness may occur if there are asymmetries in I/O-request sizes. We made this choice because I/O plugging may lower throughput, and probably a user that has not created any group cares more about throughput than about perfect fairness. At any rate, as for possible applications that may care about service guarantees, bfq already guarantees a high responsiveness and a low latency to soft real-time applications automatically. Signed-off-by: Federico Motta <federico@willer.it> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
1 parent a2fa8a1 commit 2d29c9f

File tree

3 files changed

+155
-131
lines changed

3 files changed

+155
-131
lines changed

block/bfq-iosched.c

Lines changed: 124 additions & 99 deletions
Original file line numberDiff line numberDiff line change
@@ -624,22 +624,21 @@ void bfq_pos_tree_add_move(struct bfq_data *bfqd, struct bfq_queue *bfqq)
624624
}
625625

626626
/*
627-
* Tell whether there are active queues or groups with differentiated weights.
627+
* Tell whether there are active queues with different weights or
628+
* active groups.
628629
*/
629-
static bool bfq_differentiated_weights(struct bfq_data *bfqd)
630+
static bool bfq_varied_queue_weights_or_active_groups(struct bfq_data *bfqd)
630631
{
631632
/*
632-
* For weights to differ, at least one of the trees must contain
633+
* For queue weights to differ, queue_weights_tree must contain
633634
* at least two nodes.
634635
*/
635636
return (!RB_EMPTY_ROOT(&bfqd->queue_weights_tree) &&
636637
(bfqd->queue_weights_tree.rb_node->rb_left ||
637638
bfqd->queue_weights_tree.rb_node->rb_right)
638639
#ifdef CONFIG_BFQ_GROUP_IOSCHED
639640
) ||
640-
(!RB_EMPTY_ROOT(&bfqd->group_weights_tree) &&
641-
(bfqd->group_weights_tree.rb_node->rb_left ||
642-
bfqd->group_weights_tree.rb_node->rb_right)
641+
(bfqd->num_active_groups > 0
643642
#endif
644643
);
645644
}
@@ -657,26 +656,25 @@ static bool bfq_differentiated_weights(struct bfq_data *bfqd)
657656
* 3) all active groups at the same level in the groups tree have the same
658657
* number of children.
659658
*
660-
* Unfortunately, keeping the necessary state for evaluating exactly the
661-
* above symmetry conditions would be quite complex and time-consuming.
662-
* Therefore this function evaluates, instead, the following stronger
663-
* sub-conditions, for which it is much easier to maintain the needed
664-
* state:
659+
* Unfortunately, keeping the necessary state for evaluating exactly
660+
* the last two symmetry sub-conditions above would be quite complex
661+
* and time consuming. Therefore this function evaluates, instead,
662+
* only the following stronger two sub-conditions, for which it is
663+
* much easier to maintain the needed state:
665664
* 1) all active queues have the same weight,
666-
* 2) all active groups have the same weight,
667-
* 3) all active groups have at most one active child each.
668-
* In particular, the last two conditions are always true if hierarchical
669-
* support and the cgroups interface are not enabled, thus no state needs
670-
* to be maintained in this case.
665+
* 2) there are no active groups.
666+
* In particular, the last condition is always true if hierarchical
667+
* support or the cgroups interface are not enabled, thus no state
668+
* needs to be maintained in this case.
671669
*/
672670
static bool bfq_symmetric_scenario(struct bfq_data *bfqd)
673671
{
674-
return !bfq_differentiated_weights(bfqd);
672+
return !bfq_varied_queue_weights_or_active_groups(bfqd);
675673
}
676674

677675
/*
678676
* If the weight-counter tree passed as input contains no counter for
679-
* the weight of the input entity, then add that counter; otherwise just
677+
* the weight of the input queue, then add that counter; otherwise just
680678
* increment the existing counter.
681679
*
682680
* Note that weight-counter trees contain few nodes in mostly symmetric
@@ -687,25 +685,25 @@ static bool bfq_symmetric_scenario(struct bfq_data *bfqd)
687685
* In most scenarios, the rate at which nodes are created/destroyed
688686
* should be low too.
689687
*/
690-
void bfq_weights_tree_add(struct bfq_data *bfqd, struct bfq_entity *entity,
688+
void bfq_weights_tree_add(struct bfq_data *bfqd, struct bfq_queue *bfqq,
691689
struct rb_root *root)
692690
{
691+
struct bfq_entity *entity = &bfqq->entity;
693692
struct rb_node **new = &(root->rb_node), *parent = NULL;
694693

695694
/*
696-
* Do not insert if the entity is already associated with a
695+
* Do not insert if the queue is already associated with a
697696
* counter, which happens if:
698-
* 1) the entity is associated with a queue,
699-
* 2) a request arrival has caused the queue to become both
697+
* 1) a request arrival has caused the queue to become both
700698
* non-weight-raised, and hence change its weight, and
701699
* backlogged; in this respect, each of the two events
702700
* causes an invocation of this function,
703-
* 3) this is the invocation of this function caused by the
701+
* 2) this is the invocation of this function caused by the
704702
* second event. This second invocation is actually useless,
705703
* and we handle this fact by exiting immediately. More
706704
* efficient or clearer solutions might possibly be adopted.
707705
*/
708-
if (entity->weight_counter)
706+
if (bfqq->weight_counter)
709707
return;
710708

711709
while (*new) {
@@ -715,7 +713,7 @@ void bfq_weights_tree_add(struct bfq_data *bfqd, struct bfq_entity *entity,
715713
parent = *new;
716714

717715
if (entity->weight == __counter->weight) {
718-
entity->weight_counter = __counter;
716+
bfqq->weight_counter = __counter;
719717
goto inc_counter;
720718
}
721719
if (entity->weight < __counter->weight)
@@ -724,66 +722,67 @@ void bfq_weights_tree_add(struct bfq_data *bfqd, struct bfq_entity *entity,
724722
new = &((*new)->rb_right);
725723
}
726724

727-
entity->weight_counter = kzalloc(sizeof(struct bfq_weight_counter),
728-
GFP_ATOMIC);
725+
bfqq->weight_counter = kzalloc(sizeof(struct bfq_weight_counter),
726+
GFP_ATOMIC);
729727

730728
/*
731729
* In the unlucky event of an allocation failure, we just
732-
* exit. This will cause the weight of entity to not be
733-
* considered in bfq_differentiated_weights, which, in its
734-
* turn, causes the scenario to be deemed wrongly symmetric in
735-
* case entity's weight would have been the only weight making
736-
* the scenario asymmetric. On the bright side, no unbalance
737-
* will however occur when entity becomes inactive again (the
738-
* invocation of this function is triggered by an activation
739-
* of entity). In fact, bfq_weights_tree_remove does nothing
740-
* if !entity->weight_counter.
730+
* exit. This will cause the weight of queue to not be
731+
* considered in bfq_varied_queue_weights_or_active_groups,
732+
* which, in its turn, causes the scenario to be deemed
733+
* wrongly symmetric in case bfqq's weight would have been
734+
* the only weight making the scenario asymmetric. On the
735+
* bright side, no unbalance will however occur when bfqq
736+
* becomes inactive again (the invocation of this function
737+
* is triggered by an activation of queue). In fact,
738+
* bfq_weights_tree_remove does nothing if
739+
* !bfqq->weight_counter.
741740
*/
742-
if (unlikely(!entity->weight_counter))
741+
if (unlikely(!bfqq->weight_counter))
743742
return;
744743

745-
entity->weight_counter->weight = entity->weight;
746-
rb_link_node(&entity->weight_counter->weights_node, parent, new);
747-
rb_insert_color(&entity->weight_counter->weights_node, root);
744+
bfqq->weight_counter->weight = entity->weight;
745+
rb_link_node(&bfqq->weight_counter->weights_node, parent, new);
746+
rb_insert_color(&bfqq->weight_counter->weights_node, root);
748747

749748
inc_counter:
750-
entity->weight_counter->num_active++;
749+
bfqq->weight_counter->num_active++;
751750
}
752751

753752
/*
754-
* Decrement the weight counter associated with the entity, and, if the
753+
* Decrement the weight counter associated with the queue, and, if the
755754
* counter reaches 0, remove the counter from the tree.
756755
* See the comments to the function bfq_weights_tree_add() for considerations
757756
* about overhead.
758757
*/
759758
void __bfq_weights_tree_remove(struct bfq_data *bfqd,
760-
struct bfq_entity *entity,
759+
struct bfq_queue *bfqq,
761760
struct rb_root *root)
762761
{
763-
if (!entity->weight_counter)
762+
if (!bfqq->weight_counter)
764763
return;
765764

766-
entity->weight_counter->num_active--;
767-
if (entity->weight_counter->num_active > 0)
765+
bfqq->weight_counter->num_active--;
766+
if (bfqq->weight_counter->num_active > 0)
768767
goto reset_entity_pointer;
769768

770-
rb_erase(&entity->weight_counter->weights_node, root);
771-
kfree(entity->weight_counter);
769+
rb_erase(&bfqq->weight_counter->weights_node, root);
770+
kfree(bfqq->weight_counter);
772771

773772
reset_entity_pointer:
774-
entity->weight_counter = NULL;
773+
bfqq->weight_counter = NULL;
775774
}
776775

777776
/*
778-
* Invoke __bfq_weights_tree_remove on bfqq and all its inactive
779-
* parent entities.
777+
* Invoke __bfq_weights_tree_remove on bfqq and decrement the number
778+
* of active groups for each queue's inactive parent entity.
780779
*/
781780
void bfq_weights_tree_remove(struct bfq_data *bfqd,
782781
struct bfq_queue *bfqq)
783782
{
784783
struct bfq_entity *entity = bfqq->entity.parent;
785784

786-
__bfq_weights_tree_remove(bfqd, &bfqq->entity,
785+
__bfq_weights_tree_remove(bfqd, bfqq,
787786
&bfqd->queue_weights_tree);
788787

789788
for_each_entity(entity) {
@@ -797,17 +796,13 @@ void bfq_weights_tree_remove(struct bfq_data *bfqd,
797796
* next_in_service for details on why
798797
* in_service_entity must be checked too).
799798
*
800-
* As a consequence, the weight of entity is
801-
* not to be removed. In addition, if entity
802-
* is active, then its parent entities are
803-
* active as well, and thus their weights are
804-
* not to be removed either. In the end, this
805-
* loop must stop here.
799+
* As a consequence, its parent entities are
800+
* active as well, and thus this loop must
801+
* stop here.
806802
*/
807803
break;
808804
}
809-
__bfq_weights_tree_remove(bfqd, entity,
810-
&bfqd->group_weights_tree);
805+
bfqd->num_active_groups--;
811806
}
812807
}
813808

@@ -3506,9 +3501,11 @@ static bool bfq_better_to_idle(struct bfq_queue *bfqq)
35063501
* symmetric scenario where:
35073502
* (i) each of these processes must get the same throughput as
35083503
* the others;
3509-
* (ii) all these processes have the same I/O pattern
3510-
(either sequential or random).
3511-
* In fact, in such a scenario, the drive will tend to treat
3504+
* (ii) the I/O of each process has the same properties, in
3505+
* terms of locality (sequential or random), direction
3506+
* (reads or writes), request sizes, greediness
3507+
* (from I/O-bound to sporadic), and so on.
3508+
* In fact, in such a scenario, the drive tends to treat
35123509
* the requests of each of these processes in about the same
35133510
* way as the requests of the others, and thus to provide
35143511
* each of these processes with about the same throughput
@@ -3517,18 +3514,50 @@ static bool bfq_better_to_idle(struct bfq_queue *bfqq)
35173514
* certainly needed to guarantee that bfqq receives its
35183515
* assigned fraction of the device throughput (see [1] for
35193516
* details).
3517+
* The problem is that idling may significantly reduce
3518+
* throughput with certain combinations of types of I/O and
3519+
* devices. An important example is sync random I/O, on flash
3520+
* storage with command queueing. So, unless bfqq falls in the
3521+
* above cases where idling also boosts throughput, it would
3522+
* be important to check conditions (i) and (ii) accurately,
3523+
* so as to avoid idling when not strictly needed for service
3524+
* guarantees.
3525+
*
3526+
* Unfortunately, it is extremely difficult to thoroughly
3527+
* check condition (ii). And, in case there are active groups,
3528+
* it becomes very difficult to check condition (i) too. In
3529+
* fact, if there are active groups, then, for condition (i)
3530+
* to become false, it is enough that an active group contains
3531+
* more active processes or sub-groups than some other active
3532+
* group. We address this issue with the following bi-modal
3533+
* behavior, implemented in the function
3534+
* bfq_symmetric_scenario().
35203535
*
3521-
* We address this issue by controlling, actually, only the
3522-
* symmetry sub-condition (i), i.e., provided that
3523-
* sub-condition (i) holds, idling is not performed,
3524-
* regardless of whether sub-condition (ii) holds. In other
3525-
* words, only if sub-condition (i) holds, then idling is
3536+
* If there are active groups, then the scenario is tagged as
3537+
* asymmetric, conservatively, without checking any of the
3538+
* conditions (i) and (ii). So the device is idled for bfqq.
3539+
* This behavior matches also the fact that groups are created
3540+
* exactly if controlling I/O (to preserve bandwidth and
3541+
* latency guarantees) is a primary concern.
3542+
*
3543+
* On the opposite end, if there are no active groups, then
3544+
* only condition (i) is actually controlled, i.e., provided
3545+
* that condition (i) holds, idling is not performed,
3546+
* regardless of whether condition (ii) holds. In other words,
3547+
* only if condition (i) does not hold, then idling is
35263548
* allowed, and the device tends to be prevented from queueing
3527-
* many requests, possibly of several processes. The reason
3528-
* for not controlling also sub-condition (ii) is that we
3529-
* exploit preemption to preserve guarantees in case of
3530-
* symmetric scenarios, even if (ii) does not hold, as
3531-
* explained in the next two paragraphs.
3549+
* many requests, possibly of several processes. Since there
3550+
* are no active groups, then, to control condition (i) it is
3551+
* enough to check whether all active queues have the same
3552+
* weight.
3553+
*
3554+
* Not checking condition (ii) evidently exposes bfqq to the
3555+
* risk of getting less throughput than its fair share.
3556+
* However, for queues with the same weight, a further
3557+
* mechanism, preemption, mitigates or even eliminates this
3558+
* problem. And it does so without consequences on overall
3559+
* throughput. This mechanism and its benefits are explained
3560+
* in the next three paragraphs.
35323561
*
35333562
* Even if a queue, say Q, is expired when it remains idle, Q
35343563
* can still preempt the new in-service queue if the next
@@ -3542,11 +3571,7 @@ static bool bfq_better_to_idle(struct bfq_queue *bfqq)
35423571
* idling allows the internal queues of the device to contain
35433572
* many requests, and thus to reorder requests, we can rather
35443573
* safely assume that the internal scheduler still preserves a
3545-
* minimum of mid-term fairness. The motivation for using
3546-
* preemption instead of idling is that, by not idling,
3547-
* service guarantees are preserved without minimally
3548-
* sacrificing throughput. In other words, both a high
3549-
* throughput and its desired distribution are obtained.
3574+
* minimum of mid-term fairness.
35503575
*
35513576
* More precisely, this preemption-based, idleless approach
35523577
* provides fairness in terms of IOPS, and not sectors per
@@ -3565,27 +3590,27 @@ static bool bfq_better_to_idle(struct bfq_queue *bfqq)
35653590
* 1024/8 times as high as the service received by the other
35663591
* queue.
35673592
*
3568-
* On the other hand, device idling is performed, and thus
3569-
* pure sector-domain guarantees are provided, for the
3570-
* following queues, which are likely to need stronger
3571-
* throughput guarantees: weight-raised queues, and queues
3572-
* with a higher weight than other queues. When such queues
3573-
* are active, sub-condition (i) is false, which triggers
3574-
* device idling.
3593+
* The motivation for using preemption instead of idling (for
3594+
* queues with the same weight) is that, by not idling,
3595+
* service guarantees are preserved (completely or at least in
3596+
* part) without minimally sacrificing throughput. And, if
3597+
* there is no active group, then the primary expectation for
3598+
* this device is probably a high throughput.
35753599
*
3576-
* According to the above considerations, the next variable is
3577-
* true (only) if sub-condition (i) holds. To compute the
3578-
* value of this variable, we not only use the return value of
3579-
* the function bfq_symmetric_scenario(), but also check
3580-
* whether bfqq is being weight-raised, because
3581-
* bfq_symmetric_scenario() does not take into account also
3582-
* weight-raised queues (see comments on
3583-
* bfq_weights_tree_add()). In particular, if bfqq is being
3584-
* weight-raised, it is important to idle only if there are
3585-
* other, non-weight-raised queues that may steal throughput
3586-
* to bfqq. Actually, we should be even more precise, and
3587-
* differentiate between interactive weight raising and
3588-
* soft real-time weight raising.
3600+
* We are now left only with explaining the additional
3601+
* compound condition that is checked below for deciding
3602+
* whether the scenario is asymmetric. To explain this
3603+
* compound condition, we need to add that the function
3604+
* bfq_symmetric_scenario checks the weights of only
3605+
* non-weight-raised queues, for efficiency reasons (see
3606+
* comments on bfq_weights_tree_add()). Then the fact that
3607+
* bfqq is weight-raised is checked explicitly here. More
3608+
* precisely, the compound condition below takes into account
3609+
* also the fact that, even if bfqq is being weight-raised,
3610+
* the scenario is still symmetric if all active queues happen
3611+
* to be weight-raised. Actually, we should be even more
3612+
* precise here, and differentiate between interactive weight
3613+
* raising and soft real-time weight raising.
35893614
*
35903615
* As a side note, it is worth considering that the above
35913616
* device-idling countermeasures may however fail in the
@@ -5392,7 +5417,7 @@ static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
53925417
bfqd->idle_slice_timer.function = bfq_idle_slice_timer;
53935418

53945419
bfqd->queue_weights_tree = RB_ROOT;
5395-
bfqd->group_weights_tree = RB_ROOT;
5420+
bfqd->num_active_groups = 0;
53965421

53975422
INIT_LIST_HEAD(&bfqd->active_list);
53985423
INIT_LIST_HEAD(&bfqd->idle_list);

0 commit comments

Comments
 (0)