Skip to content

Commit cc8889a

Browse files
wdebruijdavem330
authored andcommitted
doc: document MSG_ZEROCOPY
Documentation for this feature was missing from the patchset. Copied a lot from the netdev 2.1 paper, addressing some small interface changes since then. Changes v1 -> v2 - change email discussion URL format - clarify that u32 counter is per-syscall, unsigned and wraps after UINT_MAX calls - describe errno on send failure specific to MSG_ZEROCOPY - a few very minor rewordings Signed-off-by: Willem de Bruijn <willemb@google.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
1 parent 9df5905 commit cc8889a

File tree

1 file changed

+257
-0
lines changed

1 file changed

+257
-0
lines changed
Lines changed: 257 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,257 @@
1+
2+
============
3+
MSG_ZEROCOPY
4+
============
5+
6+
Intro
7+
=====
8+
9+
The MSG_ZEROCOPY flag enables copy avoidance for socket send calls.
10+
The feature is currently implemented for TCP sockets.
11+
12+
13+
Opportunity and Caveats
14+
-----------------------
15+
16+
Copying large buffers between user process and kernel can be
17+
expensive. Linux supports various interfaces that eschew copying,
18+
such as sendpage and splice. The MSG_ZEROCOPY flag extends the
19+
underlying copy avoidance mechanism to common socket send calls.
20+
21+
Copy avoidance is not a free lunch. As implemented, with page pinning,
22+
it replaces per byte copy cost with page accounting and completion
23+
notification overhead. As a result, MSG_ZEROCOPY is generally only
24+
effective at writes over around 10 KB.
25+
26+
Page pinning also changes system call semantics. It temporarily shares
27+
the buffer between process and network stack. Unlike with copying, the
28+
process cannot immediately overwrite the buffer after system call
29+
return without possibly modifying the data in flight. Kernel integrity
30+
is not affected, but a buggy program can possibly corrupt its own data
31+
stream.
32+
33+
The kernel returns a notification when it is safe to modify data.
34+
Converting an existing application to MSG_ZEROCOPY is not always as
35+
trivial as just passing the flag, then.
36+
37+
38+
More Info
39+
---------
40+
41+
Much of this document was derived from a longer paper presented at
42+
netdev 2.1. For more in-depth information see that paper and talk,
43+
the excellent reporting over at LWN.net or read the original code.
44+
45+
paper, slides, video
46+
https://netdevconf.org/2.1/session.html?debruijn
47+
48+
LWN article
49+
https://lwn.net/Articles/726917/
50+
51+
patchset
52+
[PATCH net-next v4 0/9] socket sendmsg MSG_ZEROCOPY
53+
http://lkml.kernel.org/r/20170803202945.70750-1-willemdebruijn.kernel@gmail.com
54+
55+
56+
Interface
57+
=========
58+
59+
Passing the MSG_ZEROCOPY flag is the most obvious step to enable copy
60+
avoidance, but not the only one.
61+
62+
Socket Setup
63+
------------
64+
65+
The kernel is permissive when applications pass undefined flags to the
66+
send system call. By default it simply ignores these. To avoid enabling
67+
copy avoidance mode for legacy processes that accidentally already pass
68+
this flag, a process must first signal intent by setting a socket option:
69+
70+
::
71+
72+
if (setsockopt(fd, SOL_SOCKET, SO_ZEROCOPY, &one, sizeof(one)))
73+
error(1, errno, "setsockopt zerocopy");
74+
75+
76+
Transmission
77+
------------
78+
79+
The change to send (or sendto, sendmsg, sendmmsg) itself is trivial.
80+
Pass the new flag.
81+
82+
::
83+
84+
ret = send(fd, buf, sizeof(buf), MSG_ZEROCOPY);
85+
86+
A zerocopy failure will return -1 with errno ENOBUFS. This happens if
87+
the socket option was not set, the socket exceeds its optmem limit or
88+
the user exceeds its ulimit on locked pages.
89+
90+
91+
Mixing copy avoidance and copying
92+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
93+
94+
Many workloads have a mixture of large and small buffers. Because copy
95+
avoidance is more expensive than copying for small packets, the
96+
feature is implemented as a flag. It is safe to mix calls with the flag
97+
with those without.
98+
99+
100+
Notifications
101+
-------------
102+
103+
The kernel has to notify the process when it is safe to reuse a
104+
previously passed buffer. It queues completion notifications on the
105+
socket error queue, akin to the transmit timestamping interface.
106+
107+
The notification itself is a simple scalar value. Each socket
108+
maintains an internal unsigned 32-bit counter. Each send call with
109+
MSG_ZEROCOPY that successfully sends data increments the counter. The
110+
counter is not incremented on failure or if called with length zero.
111+
The counter counts system call invocations, not bytes. It wraps after
112+
UINT_MAX calls.
113+
114+
115+
Notification Reception
116+
~~~~~~~~~~~~~~~~~~~~~~
117+
118+
The below snippet demonstrates the API. In the simplest case, each
119+
send syscall is followed by a poll and recvmsg on the error queue.
120+
121+
Reading from the error queue is always a non-blocking operation. The
122+
poll call is there to block until an error is outstanding. It will set
123+
POLLERR in its output flags. That flag does not have to be set in the
124+
events field. Errors are signaled unconditionally.
125+
126+
::
127+
128+
pfd.fd = fd;
129+
pfd.events = 0;
130+
if (poll(&pfd, 1, -1) != 1 || pfd.revents & POLLERR == 0)
131+
error(1, errno, "poll");
132+
133+
ret = recvmsg(fd, &msg, MSG_ERRQUEUE);
134+
if (ret == -1)
135+
error(1, errno, "recvmsg");
136+
137+
read_notification(msg);
138+
139+
The example is for demonstration purpose only. In practice, it is more
140+
efficient to not wait for notifications, but read without blocking
141+
every couple of send calls.
142+
143+
Notifications can be processed out of order with other operations on
144+
the socket. A socket that has an error queued would normally block
145+
other operations until the error is read. Zerocopy notifications have
146+
a zero error code, however, to not block send and recv calls.
147+
148+
149+
Notification Batching
150+
~~~~~~~~~~~~~~~~~~~~~
151+
152+
Multiple outstanding packets can be read at once using the recvmmsg
153+
call. This is often not needed. In each message the kernel returns not
154+
a single value, but a range. It coalesces consecutive notifications
155+
while one is outstanding for reception on the error queue.
156+
157+
When a new notification is about to be queued, it checks whether the
158+
new value extends the range of the notification at the tail of the
159+
queue. If so, it drops the new notification packet and instead increases
160+
the range upper value of the outstanding notification.
161+
162+
For protocols that acknowledge data in-order, like TCP, each
163+
notification can be squashed into the previous one, so that no more
164+
than one notification is outstanding at any one point.
165+
166+
Ordered delivery is the common case, but not guaranteed. Notifications
167+
may arrive out of order on retransmission and socket teardown.
168+
169+
170+
Notification Parsing
171+
~~~~~~~~~~~~~~~~~~~~
172+
173+
The below snippet demonstrates how to parse the control message: the
174+
read_notification() call in the previous snippet. A notification
175+
is encoded in the standard error format, sock_extended_err.
176+
177+
The level and type fields in the control data are protocol family
178+
specific, IP_RECVERR or IPV6_RECVERR.
179+
180+
Error origin is the new type SO_EE_ORIGIN_ZEROCOPY. ee_errno is zero,
181+
as explained before, to avoid blocking read and write system calls on
182+
the socket.
183+
184+
The 32-bit notification range is encoded as [ee_info, ee_data]. This
185+
range is inclusive. Other fields in the struct must be treated as
186+
undefined, bar for ee_code, as discussed below.
187+
188+
::
189+
190+
struct sock_extended_err *serr;
191+
struct cmsghdr *cm;
192+
193+
cm = CMSG_FIRSTHDR(msg);
194+
if (cm->cmsg_level != SOL_IP &&
195+
cm->cmsg_type != IP_RECVERR)
196+
error(1, 0, "cmsg");
197+
198+
serr = (void *) CMSG_DATA(cm);
199+
if (serr->ee_errno != 0 ||
200+
serr->ee_origin != SO_EE_ORIGIN_ZEROCOPY)
201+
error(1, 0, "serr");
202+
203+
printf("completed: %u..%u\n", serr->ee_info, serr->ee_data);
204+
205+
206+
Deferred copies
207+
~~~~~~~~~~~~~~~
208+
209+
Passing flag MSG_ZEROCOPY is a hint to the kernel to apply copy
210+
avoidance, and a contract that the kernel will queue a completion
211+
notification. It is not a guarantee that the copy is elided.
212+
213+
Copy avoidance is not always feasible. Devices that do not support
214+
scatter-gather I/O cannot send packets made up of kernel generated
215+
protocol headers plus zerocopy user data. A packet may need to be
216+
converted to a private copy of data deep in the stack, say to compute
217+
a checksum.
218+
219+
In all these cases, the kernel returns a completion notification when
220+
it releases its hold on the shared pages. That notification may arrive
221+
before the (copied) data is fully transmitted. A zerocopy completion
222+
notification is not a transmit completion notification, therefore.
223+
224+
Deferred copies can be more expensive than a copy immediately in the
225+
system call, if the data is no longer warm in the cache. The process
226+
also incurs notification processing cost for no benefit. For this
227+
reason, the kernel signals if data was completed with a copy, by
228+
setting flag SO_EE_CODE_ZEROCOPY_COPIED in field ee_code on return.
229+
A process may use this signal to stop passing flag MSG_ZEROCOPY on
230+
subsequent requests on the same socket.
231+
232+
233+
Implementation
234+
==============
235+
236+
Loopback
237+
--------
238+
239+
Data sent to local sockets can be queued indefinitely if the receive
240+
process does not read its socket. Unbound notification latency is not
241+
acceptable. For this reason all packets generated with MSG_ZEROCOPY
242+
that are looped to a local socket will incur a deferred copy. This
243+
includes looping onto packet sockets (e.g., tcpdump) and tun devices.
244+
245+
246+
Testing
247+
=======
248+
249+
More realistic example code can be found in the kernel source under
250+
tools/testing/selftests/net/msg_zerocopy.c.
251+
252+
Be cognizant of the loopback constraint. The test can be run between
253+
a pair of hosts. But if run between a local pair of processes, for
254+
instance when run with msg_zerocopy.sh between a veth pair across
255+
namespaces, the test will not show any improvement. For testing, the
256+
loopback restriction can be temporarily relaxed by making
257+
skb_orphan_frags_rx identical to skb_orphan_frags.

0 commit comments

Comments
 (0)