Skip to content

Commit 3561599

Browse files
committed
Merge branch 'socket-sendmsg-zerocopy'
Willem de Bruijn says: ==================== socket sendmsg MSG_ZEROCOPY Introduce zerocopy socket send flag MSG_ZEROCOPY. This extends the shared page support (SKBTX_SHARED_FRAG) from sendpage to sendmsg. Implement the feature for TCP initially, as large writes benefit most. On a send call with MSG_ZEROCOPY, the kernel pins user pages and links these directly into the skbuff frags[] array. Each send call with MSG_ZEROCOPY that transmits data will eventually queue a completion notification on the error queue: a per-socket u32 incremented on each such call. A request may have to revert to copy to succeed, for instance when a device cannot support scatter-gather IO. In that case a flag is passed along to notify that the operation succeeded without zerocopy optimization. The implementation extends the existing zerocopy infra for tuntap, vhost and xen with features needed for TCP, notably reference counting to handle cloning on retransmit and GSO. For more details, see also the netdev 2.1 paper and presentation at https://netdevconf.org/2.1/session.html?debruijn Changelog: v3 -> v4: - dropped UDP, RAW and PF_PACKET for now Without loopback support, datagrams are usually smaller than the ~8KB size threshold needed to benefit from zerocopy. - style: a few reverse chrismas tree - minor: SO_ZEROCOPY returns ENOTSUPP on unsupported protocols - minor: squashed SO_EE_CODE_ZEROCOPY_COPIED patch - minor: rebased on top of net-next with kmap_atomic fix v2 -> v3: - fix rebase conflict: SO_ZEROCOPY 59 -> 60 v1 -> v2: - fix (kbuild-bot): do not remove uarg until patch 5 - fix (kbuild-bot): move zerocopy_sg_from_iter doc with function - fix: remove unused extern in header file RFCv2 -> v1: - patch 2 - review comment: in skb_copy_ubufs, always allocate order-0 page, also when replacing compound source pages. - patch 3 - fix: always queue completion notification on MSG_ZEROCOPY, also if revert to copy. - fix: on syscall abort, correctly revert notification state - minor: skip queue notification on SOCK_DEAD - minor: replace BUG_ON with WARN_ON in recoverable error - patch 4 - new: add socket option SOCK_ZEROCOPY. only honor MSG_ZEROCOPY if set, ignore for legacy apps. - patch 5 - fix: clear zerocopy state on skb_linearize - patch 6 - fix: only coalesce if prev errqueue elem is zerocopy - minor: try coalescing with list tail instead of head - minor: merge bytelen limit patch - patch 7 - new: signal when data had to be copied - patch 8 (tcp) - optimize: avoid setting PSH bit when exceeding max frags. that limits GRO on the client. do not goto new_segment. - fix: fail on MSG_ZEROCOPY | MSG_FASTOPEN - minor: do not wait for memory: does not work for optmem - minor: simplify alloc - patch 9 (udp) - new: add PF_INET6 - fix: attach zerocopy notification even if revert to copy - minor: simplify alloc size arithmetic - patch 10 (raw hdrinc) - new: add PF_INET6 - patch 11 (pf_packet) - minor: simplify slightly - patch 12 - new msg_zerocopy regression test: use veth pair to test all protocols: ipv4/ipv6/packet, tcp/udp/raw, cork all relevant ethtool settings: rx off, sg off all relevant packet lengths: 0, <MAX_HEADER, max size RFC -> RFCv2: - review comment: do not loop skb with zerocopy frags onto rx: add skb_orphan_frags_rx to orphan even refcounted frags call this in __netif_receive_skb_core, deliver_skb and tun: same as commit 1080e51 ("net: orphan frags on receive") - fix: hold an explicit sk reference on each notification skb. previously relied on the reference (or wmem) held by the data skb that would trigger notification, but this breaks on skb_orphan. - fix: when aborting a send, do not inc the zerocopy counter this caused gaps in the notification chain - fix: in packet with SOCK_DGRAM, pull ll headers before calling zerocopy_sg_from_iter - fix: if sock_zerocopy_realloc does not allow coalescing, do not fail, just allocate a new ubuf - fix: in tcp, check return value of second allocation attempt - chg: allocate notification skbs from optmem to avoid affecting tcp write queue accounting (TSQ) - chg: limit #locked pages (ulimit) per user instead of per process - chg: grow notification ids from 16 to 32 bit - pass range [lo, hi] through 32 bit fields ee_info and ee_data - chg: rebased to davem-net-next on top of v4.10-rc7 - add: limit notification coalescing sharing ubufs limits overhead, but delays notification until the last packet is released, possibly unbounded. Add a cap. - tests: add snd_zerocopy_lo pf_packet test - tests: two bugfixes (add do_flush_tcp, ++sent not only in debug) Limitations / Known Issues: - TCP may build slightly smaller than max TSO packets due to exceeding MAX_SKB_FRAGS frags when zerocopy pages are unaligned. - All SKBTX_SHARED_FRAG may require additional __skb_linearize or skb_copy_ubufs calls in u32, skb_find_text, similar to skb_checksum_help. Notification skbuffs are allocated from optmem. For sockets that cannot effectively coalesce notifications, the optmem max may need to be increased to avoid hitting -ENOBUFS: sysctl -w net.core.optmem_max=1048576 In application load, copy avoidance shows a roughly 5% systemwide reduction in cycles when streaming large flows and a 4-8% reduction in wall clock time on early tensorflow test workloads. For the single-machine veth tests to succeed, loopback support has to be temporarily enabled by making skb_orphan_frags_rx map to skb_orphan_frags. * Performance The below table shows cycles reported by perf for a netperf process sending a single 10 Gbps TCP_STREAM. The first three columns show Mcycles spent in the netperf process context. The second three columns show time spent systemwide (-a -C A,B) on the two cpus that run the process and interrupt handler. Reported is the median of at least 3 runs. std is a standard netperf, zc uses zerocopy and % is the ratio. Netperf is pinned to cpu 2, network interrupts to cpu3, rps and rfs are disabled and the kernel is booted with idle=halt. NETPERF=./netperf -t TCP_STREAM -H $host -T 2 -l 30 -- -m $size perf stat -e cycles $NETPERF perf stat -C 2,3 -a -e cycles $NETPERF --process cycles-- ----cpu cycles---- std zc % std zc % 4K 27,609 11,217 41 49,217 39,175 79 16K 21,370 3,823 18 43,540 29,213 67 64K 20,557 2,312 11 42,189 26,910 64 256K 21,110 2,134 10 43,006 27,104 63 1M 20,987 1,610 8 42,759 25,931 61 Perf record indicates the main source of these differences. Process cycles only at 1M writes (perf record; perf report -n): std: Samples: 42K of event 'cycles', Event count (approx.): 21258597313 79.41% 33884 netperf [kernel.kallsyms] [k] copy_user_generic_string 3.27% 1396 netperf [kernel.kallsyms] [k] tcp_sendmsg 1.66% 694 netperf [kernel.kallsyms] [k] get_page_from_freelist 0.79% 325 netperf [kernel.kallsyms] [k] tcp_ack 0.43% 188 netperf [kernel.kallsyms] [k] __alloc_skb zc: Samples: 1K of event 'cycles', Event count (approx.): 1439509124 30.36% 584 netperf.zerocop [kernel.kallsyms] [k] gup_pte_range 14.63% 284 netperf.zerocop [kernel.kallsyms] [k] __zerocopy_sg_from_iter 8.03% 159 netperf.zerocop [kernel.kallsyms] [k] skb_zerocopy_add_frags_iter 4.84% 96 netperf.zerocop [kernel.kallsyms] [k] __alloc_skb 3.10% 60 netperf.zerocop [kernel.kallsyms] [k] kmem_cache_alloc_node * Safety The number of pages that can be pinned on behalf of a user with MSG_ZEROCOPY is bound by the locked memory ulimit. While the kernel holds process memory pinned, a process cannot safely reuse those pages for other purposes. Packets looped onto the receive stack and queued to a socket can be held indefinitely. Avoid unbounded notification latency by restricting user pages to egress paths only. skb_orphan_frags_rx() will create a private copy of pages even for refcounted packets when these are looped, as did skb_orphan_frags for the original tun zerocopy implementation. Pages are not remapped read-only. Processes can modify packet contents while packets are in flight in the kernel path. Bytes on which kernel control flow depends (headers) are copied to avoid TOCTTOU attacks. Datapath integrity does not otherwise depend on payload, with three exceptions: checksums, optional sk_filter/tc u32/.. and device + driver logic. The effect of wrong checksums is limited to the misbehaving process. TC filters that access contents may have to be excluded by adding an skb_orphan_frags_rx. Processes can also safely avoid OOM conditions by bounding the number of bytes passed with MSG_ZEROCOPY and by removing shared pages after transmission from their own memory map. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2 parents 84b7187 + 07b65c5 commit 3561599

File tree

27 files changed

+1379
-72
lines changed

27 files changed

+1379
-72
lines changed

arch/alpha/include/uapi/asm/socket.h

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -109,4 +109,6 @@
109109

110110
#define SO_PEERGROUPS 59
111111

112+
#define SO_ZEROCOPY 60
113+
112114
#endif /* _UAPI_ASM_SOCKET_H */

arch/frv/include/uapi/asm/socket.h

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -102,5 +102,7 @@
102102

103103
#define SO_PEERGROUPS 59
104104

105+
#define SO_ZEROCOPY 60
106+
105107
#endif /* _ASM_SOCKET_H */
106108

arch/ia64/include/uapi/asm/socket.h

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -111,4 +111,6 @@
111111

112112
#define SO_PEERGROUPS 59
113113

114+
#define SO_ZEROCOPY 60
115+
114116
#endif /* _ASM_IA64_SOCKET_H */

arch/m32r/include/uapi/asm/socket.h

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -102,4 +102,6 @@
102102

103103
#define SO_PEERGROUPS 59
104104

105+
#define SO_ZEROCOPY 60
106+
105107
#endif /* _ASM_M32R_SOCKET_H */

arch/mips/include/uapi/asm/socket.h

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -120,4 +120,6 @@
120120

121121
#define SO_PEERGROUPS 59
122122

123+
#define SO_ZEROCOPY 60
124+
123125
#endif /* _UAPI_ASM_SOCKET_H */

arch/mn10300/include/uapi/asm/socket.h

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -102,4 +102,6 @@
102102

103103
#define SO_PEERGROUPS 59
104104

105+
#define SO_ZEROCOPY 60
106+
105107
#endif /* _ASM_SOCKET_H */

arch/parisc/include/uapi/asm/socket.h

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -101,4 +101,6 @@
101101

102102
#define SO_PEERGROUPS 0x4034
103103

104+
#define SO_ZEROCOPY 0x4035
105+
104106
#endif /* _UAPI_ASM_SOCKET_H */

arch/s390/include/uapi/asm/socket.h

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -108,4 +108,6 @@
108108

109109
#define SO_PEERGROUPS 59
110110

111+
#define SO_ZEROCOPY 60
112+
111113
#endif /* _ASM_SOCKET_H */

arch/sparc/include/uapi/asm/socket.h

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -98,6 +98,8 @@
9898

9999
#define SO_PEERGROUPS 0x003d
100100

101+
#define SO_ZEROCOPY 0x003e
102+
101103
/* Security levels - as per NRL IPv6 - don't actually do anything */
102104
#define SO_SECURITY_AUTHENTICATION 0x5001
103105
#define SO_SECURITY_ENCRYPTION_TRANSPORT 0x5002

arch/xtensa/include/uapi/asm/socket.h

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -113,4 +113,6 @@
113113

114114
#define SO_PEERGROUPS 59
115115

116+
#define SO_ZEROCOPY 60
117+
116118
#endif /* _XTENSA_SOCKET_H */

drivers/net/tun.c

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -892,7 +892,7 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
892892
sk_filter(tfile->socket.sk, skb))
893893
goto drop;
894894

895-
if (unlikely(skb_orphan_frags(skb, GFP_ATOMIC)))
895+
if (unlikely(skb_orphan_frags_rx(skb, GFP_ATOMIC)))
896896
goto drop;
897897

898898
skb_tx_timestamp(skb);

drivers/vhost/net.c

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -533,6 +533,7 @@ static void handle_tx(struct vhost_net *net)
533533
ubuf->callback = vhost_zerocopy_callback;
534534
ubuf->ctx = nvq->ubufs;
535535
ubuf->desc = nvq->upend_idx;
536+
atomic_set(&ubuf->refcnt, 1);
536537
msg.msg_control = ubuf;
537538
msg.msg_controllen = sizeof(ubuf);
538539
ubufs = nvq->ubufs;

include/linux/sched/user.h

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,8 @@ struct user_struct {
3636
struct hlist_node uidhash_node;
3737
kuid_t uid;
3838

39-
#if defined(CONFIG_PERF_EVENTS) || defined(CONFIG_BPF_SYSCALL)
39+
#if defined(CONFIG_PERF_EVENTS) || defined(CONFIG_BPF_SYSCALL) || \
40+
defined(CONFIG_NET)
4041
atomic_long_t locked_vm;
4142
#endif
4243
};

include/linux/skbuff.h

Lines changed: 98 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -429,6 +429,7 @@ enum {
429429
SKBTX_SCHED_TSTAMP = 1 << 6,
430430
};
431431

432+
#define SKBTX_ZEROCOPY_FRAG (SKBTX_DEV_ZEROCOPY | SKBTX_SHARED_FRAG)
432433
#define SKBTX_ANY_SW_TSTAMP (SKBTX_SW_TSTAMP | \
433434
SKBTX_SCHED_TSTAMP)
434435
#define SKBTX_ANY_TSTAMP (SKBTX_HW_TSTAMP | SKBTX_ANY_SW_TSTAMP)
@@ -443,10 +444,46 @@ enum {
443444
*/
444445
struct ubuf_info {
445446
void (*callback)(struct ubuf_info *, bool zerocopy_success);
446-
void *ctx;
447-
unsigned long desc;
447+
union {
448+
struct {
449+
unsigned long desc;
450+
void *ctx;
451+
};
452+
struct {
453+
u32 id;
454+
u16 len;
455+
u16 zerocopy:1;
456+
u32 bytelen;
457+
};
458+
};
459+
atomic_t refcnt;
460+
461+
struct mmpin {
462+
struct user_struct *user;
463+
unsigned int num_pg;
464+
} mmp;
448465
};
449466

467+
#define skb_uarg(SKB) ((struct ubuf_info *)(skb_shinfo(SKB)->destructor_arg))
468+
469+
struct ubuf_info *sock_zerocopy_alloc(struct sock *sk, size_t size);
470+
struct ubuf_info *sock_zerocopy_realloc(struct sock *sk, size_t size,
471+
struct ubuf_info *uarg);
472+
473+
static inline void sock_zerocopy_get(struct ubuf_info *uarg)
474+
{
475+
atomic_inc(&uarg->refcnt);
476+
}
477+
478+
void sock_zerocopy_put(struct ubuf_info *uarg);
479+
void sock_zerocopy_put_abort(struct ubuf_info *uarg);
480+
481+
void sock_zerocopy_callback(struct ubuf_info *uarg, bool success);
482+
483+
int skb_zerocopy_iter_stream(struct sock *sk, struct sk_buff *skb,
484+
struct msghdr *msg, int len,
485+
struct ubuf_info *uarg);
486+
450487
/* This data is invariant across clones and lives at
451488
* the end of the header data, ie. at skb->end.
452489
*/
@@ -1214,6 +1251,45 @@ static inline struct skb_shared_hwtstamps *skb_hwtstamps(struct sk_buff *skb)
12141251
return &skb_shinfo(skb)->hwtstamps;
12151252
}
12161253

1254+
static inline struct ubuf_info *skb_zcopy(struct sk_buff *skb)
1255+
{
1256+
bool is_zcopy = skb && skb_shinfo(skb)->tx_flags & SKBTX_DEV_ZEROCOPY;
1257+
1258+
return is_zcopy ? skb_uarg(skb) : NULL;
1259+
}
1260+
1261+
static inline void skb_zcopy_set(struct sk_buff *skb, struct ubuf_info *uarg)
1262+
{
1263+
if (skb && uarg && !skb_zcopy(skb)) {
1264+
sock_zerocopy_get(uarg);
1265+
skb_shinfo(skb)->destructor_arg = uarg;
1266+
skb_shinfo(skb)->tx_flags |= SKBTX_ZEROCOPY_FRAG;
1267+
}
1268+
}
1269+
1270+
/* Release a reference on a zerocopy structure */
1271+
static inline void skb_zcopy_clear(struct sk_buff *skb, bool zerocopy)
1272+
{
1273+
struct ubuf_info *uarg = skb_zcopy(skb);
1274+
1275+
if (uarg) {
1276+
uarg->zerocopy = uarg->zerocopy && zerocopy;
1277+
sock_zerocopy_put(uarg);
1278+
skb_shinfo(skb)->tx_flags &= ~SKBTX_ZEROCOPY_FRAG;
1279+
}
1280+
}
1281+
1282+
/* Abort a zerocopy operation and revert zckey on error in send syscall */
1283+
static inline void skb_zcopy_abort(struct sk_buff *skb)
1284+
{
1285+
struct ubuf_info *uarg = skb_zcopy(skb);
1286+
1287+
if (uarg) {
1288+
sock_zerocopy_put_abort(uarg);
1289+
skb_shinfo(skb)->tx_flags &= ~SKBTX_ZEROCOPY_FRAG;
1290+
}
1291+
}
1292+
12171293
/**
12181294
* skb_queue_empty - check if a queue is empty
12191295
* @list: queue head
@@ -1796,13 +1872,18 @@ static inline unsigned int skb_headlen(const struct sk_buff *skb)
17961872
return skb->len - skb->data_len;
17971873
}
17981874

1799-
static inline unsigned int skb_pagelen(const struct sk_buff *skb)
1875+
static inline unsigned int __skb_pagelen(const struct sk_buff *skb)
18001876
{
18011877
unsigned int i, len = 0;
18021878

18031879
for (i = skb_shinfo(skb)->nr_frags - 1; (int)i >= 0; i--)
18041880
len += skb_frag_size(&skb_shinfo(skb)->frags[i]);
1805-
return len + skb_headlen(skb);
1881+
return len;
1882+
}
1883+
1884+
static inline unsigned int skb_pagelen(const struct sk_buff *skb)
1885+
{
1886+
return skb_headlen(skb) + __skb_pagelen(skb);
18061887
}
18071888

18081889
/**
@@ -2447,7 +2528,17 @@ static inline void skb_orphan(struct sk_buff *skb)
24472528
*/
24482529
static inline int skb_orphan_frags(struct sk_buff *skb, gfp_t gfp_mask)
24492530
{
2450-
if (likely(!(skb_shinfo(skb)->tx_flags & SKBTX_DEV_ZEROCOPY)))
2531+
if (likely(!skb_zcopy(skb)))
2532+
return 0;
2533+
if (skb_uarg(skb)->callback == sock_zerocopy_callback)
2534+
return 0;
2535+
return skb_copy_ubufs(skb, gfp_mask);
2536+
}
2537+
2538+
/* Frags must be orphaned, even if refcounted, if skb might loop to rx path */
2539+
static inline int skb_orphan_frags_rx(struct sk_buff *skb, gfp_t gfp_mask)
2540+
{
2541+
if (likely(!skb_zcopy(skb)))
24512542
return 0;
24522543
return skb_copy_ubufs(skb, gfp_mask);
24532544
}
@@ -2879,6 +2970,8 @@ static inline int skb_add_data(struct sk_buff *skb,
28792970
static inline bool skb_can_coalesce(struct sk_buff *skb, int i,
28802971
const struct page *page, int off)
28812972
{
2973+
if (skb_zcopy(skb))
2974+
return false;
28822975
if (i) {
28832976
const struct skb_frag_struct *frag = &skb_shinfo(skb)->frags[i - 1];
28842977

include/linux/socket.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -287,6 +287,7 @@ struct ucred {
287287
#define MSG_BATCH 0x40000 /* sendmmsg(): more messages coming */
288288
#define MSG_EOF MSG_FIN
289289

290+
#define MSG_ZEROCOPY 0x4000000 /* Use user data in kernel path */
290291
#define MSG_FASTOPEN 0x20000000 /* Send data in TCP SYN */
291292
#define MSG_CMSG_CLOEXEC 0x40000000 /* Set close_on_exec for file
292293
descriptor received through

include/net/sock.h

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -294,6 +294,7 @@ struct sock_common {
294294
* @sk_stamp: time stamp of last packet received
295295
* @sk_tsflags: SO_TIMESTAMPING socket options
296296
* @sk_tskey: counter to disambiguate concurrent tstamp requests
297+
* @sk_zckey: counter to order MSG_ZEROCOPY notifications
297298
* @sk_socket: Identd and reporting IO signals
298299
* @sk_user_data: RPC layer private data
299300
* @sk_frag: cached page frag
@@ -462,6 +463,7 @@ struct sock {
462463
u16 sk_tsflags;
463464
u8 sk_shutdown;
464465
u32 sk_tskey;
466+
atomic_t sk_zckey;
465467
struct socket *sk_socket;
466468
void *sk_user_data;
467469
#ifdef CONFIG_SECURITY
@@ -1531,6 +1533,8 @@ struct sk_buff *sock_wmalloc(struct sock *sk, unsigned long size, int force,
15311533
gfp_t priority);
15321534
void __sock_wfree(struct sk_buff *skb);
15331535
void sock_wfree(struct sk_buff *skb);
1536+
struct sk_buff *sock_omalloc(struct sock *sk, unsigned long size,
1537+
gfp_t priority);
15341538
void skb_orphan_partial(struct sk_buff *skb);
15351539
void sock_rfree(struct sk_buff *skb);
15361540
void sock_efree(struct sk_buff *skb);

include/uapi/asm-generic/socket.h

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -104,4 +104,6 @@
104104

105105
#define SO_PEERGROUPS 59
106106

107+
#define SO_ZEROCOPY 60
108+
107109
#endif /* __ASM_GENERIC_SOCKET_H */

include/uapi/linux/errqueue.h

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,10 +18,13 @@ struct sock_extended_err {
1818
#define SO_EE_ORIGIN_ICMP 2
1919
#define SO_EE_ORIGIN_ICMP6 3
2020
#define SO_EE_ORIGIN_TXSTATUS 4
21+
#define SO_EE_ORIGIN_ZEROCOPY 5
2122
#define SO_EE_ORIGIN_TIMESTAMPING SO_EE_ORIGIN_TXSTATUS
2223

2324
#define SO_EE_OFFENDER(ee) ((struct sockaddr*)((ee)+1))
2425

26+
#define SO_EE_CODE_ZEROCOPY_COPIED 1
27+
2528
/**
2629
* struct scm_timestamping - timestamps exposed through cmsg
2730
*

net/core/datagram.c

Lines changed: 34 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -573,27 +573,12 @@ int skb_copy_datagram_from_iter(struct sk_buff *skb, int offset,
573573
}
574574
EXPORT_SYMBOL(skb_copy_datagram_from_iter);
575575

576-
/**
577-
* zerocopy_sg_from_iter - Build a zerocopy datagram from an iov_iter
578-
* @skb: buffer to copy
579-
* @from: the source to copy from
580-
*
581-
* The function will first copy up to headlen, and then pin the userspace
582-
* pages and build frags through them.
583-
*
584-
* Returns 0, -EFAULT or -EMSGSIZE.
585-
*/
586-
int zerocopy_sg_from_iter(struct sk_buff *skb, struct iov_iter *from)
576+
int __zerocopy_sg_from_iter(struct sock *sk, struct sk_buff *skb,
577+
struct iov_iter *from, size_t length)
587578
{
588-
int len = iov_iter_count(from);
589-
int copy = min_t(int, skb_headlen(skb), len);
590-
int frag = 0;
579+
int frag = skb_shinfo(skb)->nr_frags;
591580

592-
/* copy up to skb headlen */
593-
if (skb_copy_datagram_from_iter(skb, 0, from, copy))
594-
return -EFAULT;
595-
596-
while (iov_iter_count(from)) {
581+
while (length && iov_iter_count(from)) {
597582
struct page *pages[MAX_SKB_FRAGS];
598583
size_t start;
599584
ssize_t copied;
@@ -603,18 +588,24 @@ int zerocopy_sg_from_iter(struct sk_buff *skb, struct iov_iter *from)
603588
if (frag == MAX_SKB_FRAGS)
604589
return -EMSGSIZE;
605590

606-
copied = iov_iter_get_pages(from, pages, ~0U,
591+
copied = iov_iter_get_pages(from, pages, length,
607592
MAX_SKB_FRAGS - frag, &start);
608593
if (copied < 0)
609594
return -EFAULT;
610595

611596
iov_iter_advance(from, copied);
597+
length -= copied;
612598

613599
truesize = PAGE_ALIGN(copied + start);
614600
skb->data_len += copied;
615601
skb->len += copied;
616602
skb->truesize += truesize;
617-
refcount_add(truesize, &skb->sk->sk_wmem_alloc);
603+
if (sk && sk->sk_type == SOCK_STREAM) {
604+
sk->sk_wmem_queued += truesize;
605+
sk_mem_charge(sk, truesize);
606+
} else {
607+
refcount_add(truesize, &skb->sk->sk_wmem_alloc);
608+
}
618609
while (copied) {
619610
int size = min_t(int, copied, PAGE_SIZE - start);
620611
skb_fill_page_desc(skb, frag++, pages[n], start, size);
@@ -625,6 +616,28 @@ int zerocopy_sg_from_iter(struct sk_buff *skb, struct iov_iter *from)
625616
}
626617
return 0;
627618
}
619+
EXPORT_SYMBOL(__zerocopy_sg_from_iter);
620+
621+
/**
622+
* zerocopy_sg_from_iter - Build a zerocopy datagram from an iov_iter
623+
* @skb: buffer to copy
624+
* @from: the source to copy from
625+
*
626+
* The function will first copy up to headlen, and then pin the userspace
627+
* pages and build frags through them.
628+
*
629+
* Returns 0, -EFAULT or -EMSGSIZE.
630+
*/
631+
int zerocopy_sg_from_iter(struct sk_buff *skb, struct iov_iter *from)
632+
{
633+
int copy = min_t(int, skb_headlen(skb), iov_iter_count(from));
634+
635+
/* copy up to skb headlen */
636+
if (skb_copy_datagram_from_iter(skb, 0, from, copy))
637+
return -EFAULT;
638+
639+
return __zerocopy_sg_from_iter(NULL, skb, from, ~0U);
640+
}
628641
EXPORT_SYMBOL(zerocopy_sg_from_iter);
629642

630643
static int skb_copy_and_csum_datagram(const struct sk_buff *skb, int offset,

0 commit comments

Comments
 (0)