Skip to content

Commit f2f872f

Browse files
edumazetdavem330
authored andcommitted
netem: Introduce skb_orphan_partial() helper
Commit 547669d ("tcp: xps: fix reordering issues") added unexpected reorders in case netem is used in a MQ setup for high performance test bed. ETH=eth0 tc qd del dev $ETH root 2>/dev/null tc qd add dev $ETH root handle 1: mq for i in `seq 1 32` do tc qd add dev $ETH parent 1:$i netem delay 100ms done As all tcp packets are orphaned by netem, TCP stack believes it can set skb->ooo_okay on all packets. In order to allow producers to send more packets, we want to keep sk_wmem_alloc from reaching sk_sndbuf limit. We can do that by accounting one byte per skb in netem queues, so that TCP stack is not fooled too much. Tested: With above MQ/netem setup, scaling number of concurrent flows gives linear results and no reorders/retransmits lpq83:~# for n in 1 10 20 30 40 50 60 70 80 90 100 do echo -n "n:$n " ; ./super_netperf $n -H 10.7.7.84; done n:1 198.46 n:10 2002.69 n:20 4000.98 n:30 6006.35 n:40 8020.93 n:50 10032.3 n:60 12081.9 n:70 13971.3 n:80 16009.7 n:90 17117.3 n:100 17425.5 Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
1 parent ca4c3fc commit f2f872f

File tree

3 files changed

+21
-4
lines changed

3 files changed

+21
-4
lines changed

include/net/sock.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1520,6 +1520,7 @@ extern struct sk_buff *sock_rmalloc(struct sock *sk,
15201520
unsigned long size, int force,
15211521
gfp_t priority);
15221522
extern void sock_wfree(struct sk_buff *skb);
1523+
extern void skb_orphan_partial(struct sk_buff *skb);
15231524
extern void sock_rfree(struct sk_buff *skb);
15241525
extern void sock_edemux(struct sk_buff *skb);
15251526

net/core/sock.c

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1576,6 +1576,25 @@ void sock_wfree(struct sk_buff *skb)
15761576
}
15771577
EXPORT_SYMBOL(sock_wfree);
15781578

1579+
void skb_orphan_partial(struct sk_buff *skb)
1580+
{
1581+
/* TCP stack sets skb->ooo_okay based on sk_wmem_alloc,
1582+
* so we do not completely orphan skb, but transfert all
1583+
* accounted bytes but one, to avoid unexpected reorders.
1584+
*/
1585+
if (skb->destructor == sock_wfree
1586+
#ifdef CONFIG_INET
1587+
|| skb->destructor == tcp_wfree
1588+
#endif
1589+
) {
1590+
atomic_sub(skb->truesize - 1, &skb->sk->sk_wmem_alloc);
1591+
skb->truesize = 1;
1592+
} else {
1593+
skb_orphan(skb);
1594+
}
1595+
}
1596+
EXPORT_SYMBOL(skb_orphan_partial);
1597+
15791598
/*
15801599
* Read buffer destructor automatically called from kfree_skb.
15811600
*/

net/sched/sch_netem.c

Lines changed: 1 addition & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -412,12 +412,9 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch)
412412

413413
/* If a delay is expected, orphan the skb. (orphaning usually takes
414414
* place at TX completion time, so _before_ the link transit delay)
415-
* Ideally, this orphaning should be done after the rate limiting
416-
* module, because this breaks TCP Small Queue, and other mechanisms
417-
* based on socket sk_wmem_alloc.
418415
*/
419416
if (q->latency || q->jitter)
420-
skb_orphan(skb);
417+
skb_orphan_partial(skb);
421418

422419
/*
423420
* If we need to duplicate packet, then re-insert at top of the

0 commit comments

Comments
 (0)