Skip to content

Commit fc8b81a

Browse files
committed
Merge branch 'lockless-qdisc-series'
John Fastabend says: ==================== lockless qdisc series This series adds support for building lockless qdiscs. This is the result of noticing the qdisc lock is a common hot-spot in perf analysis of the Linux network stack, especially when testing with high packet per second rates. However, nothing is free and most qdiscs rely on the qdisc lock for their data structures so each qdisc must be converted on a case by case basis. In this series, to kick things off, we make pfifo_fast, mq, and mqprio lockless. Follow up series can address additional qdiscs as needed. For example sch_tbf might be useful. To allow this the lockless design is an opt-in flag. In some future utopia we convert all qdiscs and we get to drop this case analysis, but in order to make progress we live in the real-world. There are also a handful of optimizations I have behind this series and a few code cleanups that I couldn't figure out how to fit neatly into this series with out increasing the patch count. Once this is in additional patches can address this. The most notable is in skb_dequeue we can push the consumer lock out a bit and consume multiple skbs off the skb_array in pfifo fast per iteration. Ideally we could push arrays of packets at drivers as well but we would need the infrastructure for this. The other notable improvement is to do less locking in the overrun cases where bad tx queue list and gso_skb are being hit. Although, nice in theory in practice this is the error case and I haven't found a benchmark where this matters yet. For testing... My first test case uses multiple containers (via cilium) where multiple client containers use 'wrk' to benchmark connections with a server container running lighttpd. Where lighttpd is configured to use multiple threads, one per core. Additionally this test has a proxy agent running so all traffic takes an extra hop through a proxy container. In these cases each TCP packet traverses the egress qdisc layer at least four times and the ingress qdisc layer an additional four times. This makes for a good stress test IMO, perf details below. The other micro-benchmark I run is injecting packets directly into qdisc layer using pktgen. This uses the benchmark script, ./pktgen_bench_xmit_mode_queue_xmit.sh Benchmarks taken in two cases, "base" running latest net-next no changes to qdisc layer and "qdisc" tests run with qdisc lockless updates. Numbers reported in req/sec. All virtual 'veth' devices run with pfifo_fast in the qdisc test case. `wrk -t16 -c $conns -d30 "http://[$SERVER_IP4]:80"` conns 16 32 64 1024 ----------------------------------------------- base: 18831 20201 21393 29151 qdisc: 19309 21063 23899 29265 notice in all cases we see performance improvement when running with qdisc case. Microbenchmarks using pktgen are as follows, `pktgen_bench_xmit_mode_queue_xmit.sh -t 1 -i eth2 -c 20000000 base(mq): 2.1Mpps base(pfifo_fast): 2.1Mpps qdisc(mq): 2.6Mpps qdisc(pfifo_fast): 2.6Mpps notice numbers are the same for mq and pfifo_fast because only testing a single thread here. In both tests we see a nice bump in performance gain. The key with 'mq' is it is already per txq ring so contention is minimal in the above cases. Qdiscs such as tbf or htb which have more contention will likely show larger gains when/if lockless versions are implemented. Thanks to everyone who helped with this work especially Daniel Borkmann, Eric Dumazet and Willem de Bruijn for discussing the design and reviewing versions of the code. Changes from the RFC: dropped a couple patches off the end, fixed a bug with skb_queue_walk_safe not unlinking skb in all cases, fixed a lockdep splat with pfifo_fast_destroy not calling *_bh lock variant, addressed _most_ of Willem's comments, there was a bug in the bulk locking (final patch) of the RFC series. @willem, I left out lockdep annotation for a follow on series to add lockdep more completely, rather than just in code I touched. Comments and feedback welcome. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2 parents fdaa767 + c5ad119 commit fc8b81a

File tree

10 files changed

+512
-176
lines changed

10 files changed

+512
-176
lines changed

include/linux/skb_array.h

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -72,6 +72,11 @@ static inline bool __skb_array_empty(struct skb_array *a)
7272
return !__ptr_ring_peek(&a->ring);
7373
}
7474

75+
static inline struct sk_buff *__skb_array_peek(struct skb_array *a)
76+
{
77+
return __ptr_ring_peek(&a->ring);
78+
}
79+
7580
static inline bool skb_array_empty(struct skb_array *a)
7681
{
7782
return ptr_ring_empty(&a->ring);

include/net/gen_stats.h

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -49,6 +49,9 @@ int gnet_stats_copy_rate_est(struct gnet_dump *d,
4949
int gnet_stats_copy_queue(struct gnet_dump *d,
5050
struct gnet_stats_queue __percpu *cpu_q,
5151
struct gnet_stats_queue *q, __u32 qlen);
52+
void __gnet_stats_copy_queue(struct gnet_stats_queue *qstats,
53+
const struct gnet_stats_queue __percpu *cpu_q,
54+
const struct gnet_stats_queue *q, __u32 qlen);
5255
int gnet_stats_copy_app(struct gnet_dump *d, void *st, int len);
5356

5457
int gnet_stats_finish_copy(struct gnet_dump *d);

include/net/pkt_sched.h

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -105,16 +105,18 @@ struct qdisc_rate_table *qdisc_get_rtab(struct tc_ratespec *r,
105105
void qdisc_put_rtab(struct qdisc_rate_table *tab);
106106
void qdisc_put_stab(struct qdisc_size_table *tab);
107107
void qdisc_warn_nonwc(const char *txt, struct Qdisc *qdisc);
108-
int sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q,
109-
struct net_device *dev, struct netdev_queue *txq,
110-
spinlock_t *root_lock, bool validate);
108+
bool sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q,
109+
struct net_device *dev, struct netdev_queue *txq,
110+
spinlock_t *root_lock, bool validate);
111111

112112
void __qdisc_run(struct Qdisc *q);
113113

114114
static inline void qdisc_run(struct Qdisc *q)
115115
{
116-
if (qdisc_run_begin(q))
116+
if (qdisc_run_begin(q)) {
117117
__qdisc_run(q);
118+
qdisc_run_end(q);
119+
}
118120
}
119121

120122
static inline __be16 tc_skb_protocol(const struct sk_buff *skb)

include/net/sch_generic.h

Lines changed: 70 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -71,6 +71,7 @@ struct Qdisc {
7171
* qdisc_tree_decrease_qlen() should stop.
7272
*/
7373
#define TCQ_F_INVISIBLE 0x80 /* invisible by default in dump */
74+
#define TCQ_F_NOLOCK 0x100 /* qdisc does not require locking */
7475
u32 limit;
7576
const struct Qdisc_ops *ops;
7677
struct qdisc_size_table __rcu *stab;
@@ -87,14 +88,14 @@ struct Qdisc {
8788
/*
8889
* For performance sake on SMP, we put highly modified fields at the end
8990
*/
90-
struct sk_buff *gso_skb ____cacheline_aligned_in_smp;
91+
struct sk_buff_head gso_skb ____cacheline_aligned_in_smp;
9192
struct qdisc_skb_head q;
9293
struct gnet_stats_basic_packed bstats;
9394
seqcount_t running;
9495
struct gnet_stats_queue qstats;
9596
unsigned long state;
9697
struct Qdisc *next_sched;
97-
struct sk_buff *skb_bad_txq;
98+
struct sk_buff_head skb_bad_txq;
9899
int padded;
99100
refcount_t refcnt;
100101

@@ -179,6 +180,7 @@ struct Qdisc_ops {
179180
const struct Qdisc_class_ops *cl_ops;
180181
char id[IFNAMSIZ];
181182
int priv_size;
183+
unsigned int static_flags;
182184

183185
int (*enqueue)(struct sk_buff *skb,
184186
struct Qdisc *sch,
@@ -290,11 +292,31 @@ static inline void qdisc_cb_private_validate(const struct sk_buff *skb, int sz)
290292
BUILD_BUG_ON(sizeof(qcb->data) < sz);
291293
}
292294

295+
static inline int qdisc_qlen_cpu(const struct Qdisc *q)
296+
{
297+
return this_cpu_ptr(q->cpu_qstats)->qlen;
298+
}
299+
293300
static inline int qdisc_qlen(const struct Qdisc *q)
294301
{
295302
return q->q.qlen;
296303
}
297304

305+
static inline int qdisc_qlen_sum(const struct Qdisc *q)
306+
{
307+
__u32 qlen = 0;
308+
int i;
309+
310+
if (q->flags & TCQ_F_NOLOCK) {
311+
for_each_possible_cpu(i)
312+
qlen += per_cpu_ptr(q->cpu_qstats, i)->qlen;
313+
} else {
314+
qlen = q->q.qlen;
315+
}
316+
317+
return qlen;
318+
}
319+
298320
static inline struct qdisc_skb_cb *qdisc_skb_cb(const struct sk_buff *skb)
299321
{
300322
return (struct qdisc_skb_cb *)skb->cb;
@@ -631,12 +653,39 @@ static inline void qdisc_qstats_backlog_dec(struct Qdisc *sch,
631653
sch->qstats.backlog -= qdisc_pkt_len(skb);
632654
}
633655

656+
static inline void qdisc_qstats_cpu_backlog_dec(struct Qdisc *sch,
657+
const struct sk_buff *skb)
658+
{
659+
this_cpu_sub(sch->cpu_qstats->backlog, qdisc_pkt_len(skb));
660+
}
661+
634662
static inline void qdisc_qstats_backlog_inc(struct Qdisc *sch,
635663
const struct sk_buff *skb)
636664
{
637665
sch->qstats.backlog += qdisc_pkt_len(skb);
638666
}
639667

668+
static inline void qdisc_qstats_cpu_backlog_inc(struct Qdisc *sch,
669+
const struct sk_buff *skb)
670+
{
671+
this_cpu_add(sch->cpu_qstats->backlog, qdisc_pkt_len(skb));
672+
}
673+
674+
static inline void qdisc_qstats_cpu_qlen_inc(struct Qdisc *sch)
675+
{
676+
this_cpu_inc(sch->cpu_qstats->qlen);
677+
}
678+
679+
static inline void qdisc_qstats_cpu_qlen_dec(struct Qdisc *sch)
680+
{
681+
this_cpu_dec(sch->cpu_qstats->qlen);
682+
}
683+
684+
static inline void qdisc_qstats_cpu_requeues_inc(struct Qdisc *sch)
685+
{
686+
this_cpu_inc(sch->cpu_qstats->requeues);
687+
}
688+
640689
static inline void __qdisc_qstats_drop(struct Qdisc *sch, int count)
641690
{
642691
sch->qstats.drops += count;
@@ -767,26 +816,30 @@ static inline struct sk_buff *qdisc_peek_head(struct Qdisc *sch)
767816
/* generic pseudo peek method for non-work-conserving qdisc */
768817
static inline struct sk_buff *qdisc_peek_dequeued(struct Qdisc *sch)
769818
{
819+
struct sk_buff *skb = skb_peek(&sch->gso_skb);
820+
770821
/* we can reuse ->gso_skb because peek isn't called for root qdiscs */
771-
if (!sch->gso_skb) {
772-
sch->gso_skb = sch->dequeue(sch);
773-
if (sch->gso_skb) {
822+
if (!skb) {
823+
skb = sch->dequeue(sch);
824+
825+
if (skb) {
826+
__skb_queue_head(&sch->gso_skb, skb);
774827
/* it's still part of the queue */
775-
qdisc_qstats_backlog_inc(sch, sch->gso_skb);
828+
qdisc_qstats_backlog_inc(sch, skb);
776829
sch->q.qlen++;
777830
}
778831
}
779832

780-
return sch->gso_skb;
833+
return skb;
781834
}
782835

783836
/* use instead of qdisc->dequeue() for all qdiscs queried with ->peek() */
784837
static inline struct sk_buff *qdisc_dequeue_peeked(struct Qdisc *sch)
785838
{
786-
struct sk_buff *skb = sch->gso_skb;
839+
struct sk_buff *skb = skb_peek(&sch->gso_skb);
787840

788841
if (skb) {
789-
sch->gso_skb = NULL;
842+
skb = __skb_dequeue(&sch->gso_skb);
790843
qdisc_qstats_backlog_dec(sch, skb);
791844
sch->q.qlen--;
792845
} else {
@@ -844,6 +897,14 @@ static inline void rtnl_qdisc_drop(struct sk_buff *skb, struct Qdisc *sch)
844897
qdisc_qstats_drop(sch);
845898
}
846899

900+
static inline int qdisc_drop_cpu(struct sk_buff *skb, struct Qdisc *sch,
901+
struct sk_buff **to_free)
902+
{
903+
__qdisc_drop(skb, to_free);
904+
qdisc_qstats_cpu_drop(sch);
905+
906+
return NET_XMIT_DROP;
907+
}
847908

848909
static inline int qdisc_drop(struct sk_buff *skb, struct Qdisc *sch,
849910
struct sk_buff **to_free)

net/core/dev.c

Lines changed: 25 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -3162,6 +3162,21 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
31623162
int rc;
31633163

31643164
qdisc_calculate_pkt_len(skb, q);
3165+
3166+
if (q->flags & TCQ_F_NOLOCK) {
3167+
if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) {
3168+
__qdisc_drop(skb, &to_free);
3169+
rc = NET_XMIT_DROP;
3170+
} else {
3171+
rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK;
3172+
__qdisc_run(q);
3173+
}
3174+
3175+
if (unlikely(to_free))
3176+
kfree_skb_list(to_free);
3177+
return rc;
3178+
}
3179+
31653180
/*
31663181
* Heuristic to force contended enqueues to serialize on a
31673182
* separate lock before trying to get qdisc main lock.
@@ -3192,9 +3207,9 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
31923207
contended = false;
31933208
}
31943209
__qdisc_run(q);
3195-
} else
3196-
qdisc_run_end(q);
3210+
}
31973211

3212+
qdisc_run_end(q);
31983213
rc = NET_XMIT_SUCCESS;
31993214
} else {
32003215
rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK;
@@ -3204,6 +3219,7 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
32043219
contended = false;
32053220
}
32063221
__qdisc_run(q);
3222+
qdisc_run_end(q);
32073223
}
32083224
}
32093225
spin_unlock(root_lock);
@@ -4143,19 +4159,22 @@ static __latent_entropy void net_tx_action(struct softirq_action *h)
41434159

41444160
while (head) {
41454161
struct Qdisc *q = head;
4146-
spinlock_t *root_lock;
4162+
spinlock_t *root_lock = NULL;
41474163

41484164
head = head->next_sched;
41494165

4150-
root_lock = qdisc_lock(q);
4151-
spin_lock(root_lock);
4166+
if (!(q->flags & TCQ_F_NOLOCK)) {
4167+
root_lock = qdisc_lock(q);
4168+
spin_lock(root_lock);
4169+
}
41524170
/* We need to make sure head->next_sched is read
41534171
* before clearing __QDISC_STATE_SCHED
41544172
*/
41554173
smp_mb__before_atomic();
41564174
clear_bit(__QDISC_STATE_SCHED, &q->state);
41574175
qdisc_run(q);
4158-
spin_unlock(root_lock);
4176+
if (root_lock)
4177+
spin_unlock(root_lock);
41594178
}
41604179
}
41614180
}

net/core/gen_stats.c

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -252,10 +252,10 @@ __gnet_stats_copy_queue_cpu(struct gnet_stats_queue *qstats,
252252
}
253253
}
254254

255-
static void __gnet_stats_copy_queue(struct gnet_stats_queue *qstats,
256-
const struct gnet_stats_queue __percpu *cpu,
257-
const struct gnet_stats_queue *q,
258-
__u32 qlen)
255+
void __gnet_stats_copy_queue(struct gnet_stats_queue *qstats,
256+
const struct gnet_stats_queue __percpu *cpu,
257+
const struct gnet_stats_queue *q,
258+
__u32 qlen)
259259
{
260260
if (cpu) {
261261
__gnet_stats_copy_queue_cpu(qstats, cpu);
@@ -269,6 +269,7 @@ static void __gnet_stats_copy_queue(struct gnet_stats_queue *qstats,
269269

270270
qstats->qlen = qlen;
271271
}
272+
EXPORT_SYMBOL(__gnet_stats_copy_queue);
272273

273274
/**
274275
* gnet_stats_copy_queue - copy queue statistics into statistics TLV

net/sched/sch_api.c

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -797,7 +797,8 @@ static int tc_fill_qdisc(struct sk_buff *skb, struct Qdisc *q, u32 clid,
797797
goto nla_put_failure;
798798
if (q->ops->dump && q->ops->dump(q, skb) < 0)
799799
goto nla_put_failure;
800-
qlen = q->q.qlen;
800+
801+
qlen = qdisc_qlen_sum(q);
801802

802803
stab = rtnl_dereference(q->stab);
803804
if (stab && qdisc_dump_stab(skb, stab) < 0)
@@ -954,6 +955,11 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
954955
} else {
955956
const struct Qdisc_class_ops *cops = parent->ops->cl_ops;
956957

958+
/* Only support running class lockless if parent is lockless */
959+
if (new && (new->flags & TCQ_F_NOLOCK) &&
960+
parent && !(parent->flags & TCQ_F_NOLOCK))
961+
new->flags &= ~TCQ_F_NOLOCK;
962+
957963
err = -EOPNOTSUPP;
958964
if (cops && cops->graft) {
959965
unsigned long cl = cops->find(parent, classid);

0 commit comments

Comments
 (0)