Skip to content

Commit 3eb39f4

Browse files
committed
signal: add pidfd_send_signal() syscall
The kill() syscall operates on process identifiers (pid). After a process has exited its pid can be reused by another process. If a caller sends a signal to a reused pid it will end up signaling the wrong process. This issue has often surfaced and there has been a push to address this problem [1]. This patch uses file descriptors (fd) from proc/<pid> as stable handles on struct pid. Even if a pid is recycled the handle will not change. The fd can be used to send signals to the process it refers to. Thus, the new syscall pidfd_send_signal() is introduced to solve this problem. Instead of pids it operates on process fds (pidfd). /* prototype and argument /* long pidfd_send_signal(int pidfd, int sig, siginfo_t *info, unsigned int flags); /* syscall number 424 */ The syscall number was chosen to be 424 to align with Arnd's rework in his y2038 to minimize merge conflicts (cf. [25]). In addition to the pidfd and signal argument it takes an additional siginfo_t and flags argument. If the siginfo_t argument is NULL then pidfd_send_signal() is equivalent to kill(<positive-pid>, <signal>). If it is not NULL pidfd_send_signal() is equivalent to rt_sigqueueinfo(). The flags argument is added to allow for future extensions of this syscall. It currently needs to be passed as 0. Failing to do so will cause EINVAL. /* pidfd_send_signal() replaces multiple pid-based syscalls */ The pidfd_send_signal() syscall currently takes on the job of rt_sigqueueinfo(2) and parts of the functionality of kill(2), Namely, when a positive pid is passed to kill(2). It will however be possible to also replace tgkill(2) and rt_tgsigqueueinfo(2) if this syscall is extended. /* sending signals to threads (tid) and process groups (pgid) */ Specifically, the pidfd_send_signal() syscall does currently not operate on process groups or threads. This is left for future extensions. In order to extend the syscall to allow sending signal to threads and process groups appropriately named flags (e.g. PIDFD_TYPE_PGID, and PIDFD_TYPE_TID) should be added. This implies that the flags argument will determine what is signaled and not the file descriptor itself. Put in other words, grouping in this api is a property of the flags argument not a property of the file descriptor (cf. [13]). Clarification for this has been requested by Eric (cf. [19]). When appropriate extensions through the flags argument are added then pidfd_send_signal() can additionally replace the part of kill(2) which operates on process groups as well as the tgkill(2) and rt_tgsigqueueinfo(2) syscalls. How such an extension could be implemented has been very roughly sketched in [14], [15], and [16]. However, this should not be taken as a commitment to a particular implementation. There might be better ways to do it. Right now this is intentionally left out to keep this patchset as simple as possible (cf. [4]). /* naming */ The syscall had various names throughout iterations of this patchset: - procfd_signal() - procfd_send_signal() - taskfd_send_signal() In the last round of reviews it was pointed out that given that if the flags argument decides the scope of the signal instead of different types of fds it might make sense to either settle for "procfd_" or "pidfd_" as prefix. The community was willing to accept either (cf. [17] and [18]). Given that one developer expressed strong preference for the "pidfd_" prefix (cf. [13]) and with other developers less opinionated about the name we should settle for "pidfd_" to avoid further bikeshedding. The "_send_signal" suffix was chosen to reflect the fact that the syscall takes on the job of multiple syscalls. It is therefore intentional that the name is not reminiscent of neither kill(2) nor rt_sigqueueinfo(2). Not the fomer because it might imply that pidfd_send_signal() is a replacement for kill(2), and not the latter because it is a hassle to remember the correct spelling - especially for non-native speakers - and because it is not descriptive enough of what the syscall actually does. The name "pidfd_send_signal" makes it very clear that its job is to send signals. /* zombies */ Zombies can be signaled just as any other process. No special error will be reported since a zombie state is an unreliable state (cf. [3]). However, this can be added as an extension through the @flags argument if the need ever arises. /* cross-namespace signals */ The patch currently enforces that the signaler and signalee either are in the same pid namespace or that the signaler's pid namespace is an ancestor of the signalee's pid namespace. This is done for the sake of simplicity and because it is unclear to what values certain members of struct siginfo_t would need to be set to (cf. [5], [6]). /* compat syscalls */ It became clear that we would like to avoid adding compat syscalls (cf. [7]). The compat syscall handling is now done in kernel/signal.c itself by adding __copy_siginfo_from_user_generic() which lets us avoid compat syscalls (cf. [8]). It should be noted that the addition of __copy_siginfo_from_user_any() is caused by a bug in the original implementation of rt_sigqueueinfo(2) (cf. 12). With upcoming rework for syscall handling things might improve significantly (cf. [11]) and __copy_siginfo_from_user_any() will not gain any additional callers. /* testing */ This patch was tested on x64 and x86. /* userspace usage */ An asciinema recording for the basic functionality can be found under [9]. With this patch a process can be killed via: #define _GNU_SOURCE #include <errno.h> #include <fcntl.h> #include <signal.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <sys/stat.h> #include <sys/syscall.h> #include <sys/types.h> #include <unistd.h> static inline int do_pidfd_send_signal(int pidfd, int sig, siginfo_t *info, unsigned int flags) { #ifdef __NR_pidfd_send_signal return syscall(__NR_pidfd_send_signal, pidfd, sig, info, flags); #else return -ENOSYS; #endif } int main(int argc, char *argv[]) { int fd, ret, saved_errno, sig; if (argc < 3) exit(EXIT_FAILURE); fd = open(argv[1], O_DIRECTORY | O_CLOEXEC); if (fd < 0) { printf("%s - Failed to open \"%s\"\n", strerror(errno), argv[1]); exit(EXIT_FAILURE); } sig = atoi(argv[2]); printf("Sending signal %d to process %s\n", sig, argv[1]); ret = do_pidfd_send_signal(fd, sig, NULL, 0); saved_errno = errno; close(fd); errno = saved_errno; if (ret < 0) { printf("%s - Failed to send signal %d to process %s\n", strerror(errno), sig, argv[1]); exit(EXIT_FAILURE); } exit(EXIT_SUCCESS); } /* Q&A * Given that it seems the same questions get asked again by people who are * late to the party it makes sense to add a Q&A section to the commit * message so it's hopefully easier to avoid duplicate threads. * * For the sake of progress please consider these arguments settled unless * there is a new point that desperately needs to be addressed. Please make * sure to check the links to the threads in this commit message whether * this has not already been covered. */ Q-01: (Florian Weimer [20], Andrew Morton [21]) What happens when the target process has exited? A-01: Sending the signal will fail with ESRCH (cf. [22]). Q-02: (Andrew Morton [21]) Is the task_struct pinned by the fd? A-02: No. A reference to struct pid is kept. struct pid - as far as I understand - was created exactly for the reason to not require to pin struct task_struct (cf. [22]). Q-03: (Andrew Morton [21]) Does the entire procfs directory remain visible? Just one entry within it? A-03: The same thing that happens right now when you hold a file descriptor to /proc/<pid> open (cf. [22]). Q-04: (Andrew Morton [21]) Does the pid remain reserved? A-04: No. This patchset guarantees a stable handle not that pids are not recycled (cf. [22]). Q-05: (Andrew Morton [21]) Do attempts to signal that fd return errors? A-05: See {Q,A}-01. Q-06: (Andrew Morton [22]) Is there a cleaner way of obtaining the fd? Another syscall perhaps. A-06: Userspace can already trivially retrieve file descriptors from procfs so this is something that we will need to support anyway. Hence, there's no immediate need to add another syscalls just to make pidfd_send_signal() not dependent on the presence of procfs. However, adding a syscalls to get such file descriptors is planned for a future patchset (cf. [22]). Q-07: (Andrew Morton [21] and others) This fd-for-a-process sounds like a handy thing and people may well think up other uses for it in the future, probably unrelated to signals. Are the code and the interface designed to permit such future applications? A-07: Yes (cf. [22]). Q-08: (Andrew Morton [21] and others) Now I think about it, why a new syscall? This thing is looking rather like an ioctl? A-08: This has been extensively discussed. It was agreed that a syscall is preferred for a variety or reasons. Here are just a few taken from prior threads. Syscalls are safer than ioctl()s especially when signaling to fds. Processes are a core kernel concept so a syscall seems more appropriate. The layout of the syscall with its four arguments would require the addition of a custom struct for the ioctl() thereby causing at least the same amount or even more complexity for userspace than a simple syscall. The new syscall will replace multiple other pid-based syscalls (see description above). The file-descriptors-for-processes concept introduced with this syscall will be extended with other syscalls in the future. See also [22], [23] and various other threads already linked in here. Q-09: (Florian Weimer [24]) What happens if you use the new interface with an O_PATH descriptor? A-09: pidfds opened as O_PATH fds cannot be used to send signals to a process (cf. [2]). Signaling processes through pidfds is the equivalent of writing to a file. Thus, this is not an operation that operates "purely at the file descriptor level" as required by the open(2) manpage. See also [4]. /* References */ [1]: https://lore.kernel.org/lkml/20181029221037.87724-1-dancol@google.com/ [2]: https://lore.kernel.org/lkml/874lbtjvtd.fsf@oldenburg2.str.redhat.com/ [3]: https://lore.kernel.org/lkml/20181204132604.aspfupwjgjx6fhva@brauner.io/ [4]: https://lore.kernel.org/lkml/20181203180224.fkvw4kajtbvru2ku@brauner.io/ [5]: https://lore.kernel.org/lkml/20181121213946.GA10795@mail.hallyn.com/ [6]: https://lore.kernel.org/lkml/20181120103111.etlqp7zop34v6nv4@brauner.io/ [7]: https://lore.kernel.org/lkml/36323361-90BD-41AF-AB5B-EE0D7BA02C21@amacapital.net/ [8]: https://lore.kernel.org/lkml/87tvjxp8pc.fsf@xmission.com/ [9]: https://asciinema.org/a/IQjuCHew6bnq1cr78yuMv16cy [11]: https://lore.kernel.org/lkml/F53D6D38-3521-4C20-9034-5AF447DF62FF@amacapital.net/ [12]: https://lore.kernel.org/lkml/87zhtjn8ck.fsf@xmission.com/ [13]: https://lore.kernel.org/lkml/871s6u9z6u.fsf@xmission.com/ [14]: https://lore.kernel.org/lkml/20181206231742.xxi4ghn24z4h2qki@brauner.io/ [15]: https://lore.kernel.org/lkml/20181207003124.GA11160@mail.hallyn.com/ [16]: https://lore.kernel.org/lkml/20181207015423.4miorx43l3qhppfz@brauner.io/ [17]: https://lore.kernel.org/lkml/CAGXu5jL8PciZAXvOvCeCU3wKUEB_dU-O3q0tDw4uB_ojMvDEew@mail.gmail.com/ [18]: https://lore.kernel.org/lkml/20181206222746.GB9224@mail.hallyn.com/ [19]: https://lore.kernel.org/lkml/20181208054059.19813-1-christian@brauner.io/ [20]: https://lore.kernel.org/lkml/8736rebl9s.fsf@oldenburg.str.redhat.com/ [21]: https://lore.kernel.org/lkml/20181228152012.dbf0508c2508138efc5f2bbe@linux-foundation.org/ [22]: https://lore.kernel.org/lkml/20181228233725.722tdfgijxcssg76@brauner.io/ [23]: https://lwn.net/Articles/773459/ [24]: https://lore.kernel.org/lkml/8736rebl9s.fsf@oldenburg.str.redhat.com/ [25]: https://lore.kernel.org/lkml/CAK8P3a0ej9NcJM8wXNPbcGUyOUZYX+VLoDFdbenW3s3114oQZw@mail.gmail.com/ Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Jann Horn <jannh@google.com> Cc: Andy Lutomirsky <luto@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Florian Weimer <fweimer@redhat.com> Signed-off-by: Christian Brauner <christian@brauner.io> Reviewed-by: Tycho Andersen <tycho@tycho.ws> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: David Howells <dhowells@redhat.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Serge Hallyn <serge@hallyn.com> Acked-by: Aleksa Sarai <cyphar@cyphar.com>
1 parent f17b5f0 commit 3eb39f4

File tree

8 files changed

+151
-7
lines changed

8 files changed

+151
-7
lines changed

arch/x86/entry/syscalls/syscall_32.tbl

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -398,3 +398,4 @@
398398
384 i386 arch_prctl sys_arch_prctl __ia32_compat_sys_arch_prctl
399399
385 i386 io_pgetevents sys_io_pgetevents __ia32_compat_sys_io_pgetevents
400400
386 i386 rseq sys_rseq __ia32_sys_rseq
401+
424 i386 pidfd_send_signal sys_pidfd_send_signal __ia32_sys_pidfd_send_signal

arch/x86/entry/syscalls/syscall_64.tbl

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -343,6 +343,7 @@
343343
332 common statx __x64_sys_statx
344344
333 common io_pgetevents __x64_sys_io_pgetevents
345345
334 common rseq __x64_sys_rseq
346+
424 common pidfd_send_signal __x64_sys_pidfd_send_signal
346347

347348
#
348349
# x32-specific system call numbers start at 512 to avoid cache impact

fs/proc/base.c

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3046,6 +3046,15 @@ static const struct file_operations proc_tgid_base_operations = {
30463046
.llseek = generic_file_llseek,
30473047
};
30483048

3049+
struct pid *tgid_pidfd_to_pid(const struct file *file)
3050+
{
3051+
if (!d_is_dir(file->f_path.dentry) ||
3052+
(file->f_op != &proc_tgid_base_operations))
3053+
return ERR_PTR(-EBADF);
3054+
3055+
return proc_pid(file_inode(file));
3056+
}
3057+
30493058
static struct dentry *proc_tgid_base_lookup(struct inode *dir, struct dentry *dentry, unsigned int flags)
30503059
{
30513060
return proc_pident_lookup(dir, dentry,

include/linux/proc_fs.h

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -73,6 +73,7 @@ struct proc_dir_entry *proc_create_net_single_write(const char *name, umode_t mo
7373
int (*show)(struct seq_file *, void *),
7474
proc_write_t write,
7575
void *data);
76+
extern struct pid *tgid_pidfd_to_pid(const struct file *file);
7677

7778
#else /* CONFIG_PROC_FS */
7879

@@ -114,6 +115,11 @@ static inline int remove_proc_subtree(const char *name, struct proc_dir_entry *p
114115
#define proc_create_net(name, mode, parent, state_size, ops) ({NULL;})
115116
#define proc_create_net_single(name, mode, parent, show, data) ({NULL;})
116117

118+
static inline struct pid *tgid_pidfd_to_pid(const struct file *file)
119+
{
120+
return ERR_PTR(-EBADF);
121+
}
122+
117123
#endif /* CONFIG_PROC_FS */
118124

119125
struct net;

include/linux/syscalls.h

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -926,6 +926,9 @@ asmlinkage long sys_statx(int dfd, const char __user *path, unsigned flags,
926926
unsigned mask, struct statx __user *buffer);
927927
asmlinkage long sys_rseq(struct rseq __user *rseq, uint32_t rseq_len,
928928
int flags, uint32_t sig);
929+
asmlinkage long sys_pidfd_send_signal(int pidfd, int sig,
930+
siginfo_t __user *info,
931+
unsigned int flags);
929932

930933
/*
931934
* Architecture-specific system calls

include/uapi/asm-generic/unistd.h

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -740,9 +740,11 @@ __SC_COMP(__NR_io_pgetevents, sys_io_pgetevents, compat_sys_io_pgetevents)
740740
__SYSCALL(__NR_rseq, sys_rseq)
741741
#define __NR_kexec_file_load 294
742742
__SYSCALL(__NR_kexec_file_load, sys_kexec_file_load)
743+
#define __NR_pidfd_send_signal 424
744+
__SYSCALL(__NR_pidfd_send_signal, sys_pidfd_send_signal)
743745

744746
#undef __NR_syscalls
745-
#define __NR_syscalls 295
747+
#define __NR_syscalls 425
746748

747749
/*
748750
* 32 bit systems traditionally used different

kernel/signal.c

Lines changed: 127 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,9 @@
1919
#include <linux/sched/task.h>
2020
#include <linux/sched/task_stack.h>
2121
#include <linux/sched/cputime.h>
22+
#include <linux/file.h>
2223
#include <linux/fs.h>
24+
#include <linux/proc_fs.h>
2325
#include <linux/tty.h>
2426
#include <linux/binfmts.h>
2527
#include <linux/coredump.h>
@@ -3429,6 +3431,16 @@ COMPAT_SYSCALL_DEFINE4(rt_sigtimedwait, compat_sigset_t __user *, uthese,
34293431
#endif
34303432
#endif
34313433

3434+
static inline void prepare_kill_siginfo(int sig, struct kernel_siginfo *info)
3435+
{
3436+
clear_siginfo(info);
3437+
info->si_signo = sig;
3438+
info->si_errno = 0;
3439+
info->si_code = SI_USER;
3440+
info->si_pid = task_tgid_vnr(current);
3441+
info->si_uid = from_kuid_munged(current_user_ns(), current_uid());
3442+
}
3443+
34323444
/**
34333445
* sys_kill - send a signal to a process
34343446
* @pid: the PID of the process
@@ -3438,16 +3450,125 @@ SYSCALL_DEFINE2(kill, pid_t, pid, int, sig)
34383450
{
34393451
struct kernel_siginfo info;
34403452

3441-
clear_siginfo(&info);
3442-
info.si_signo = sig;
3443-
info.si_errno = 0;
3444-
info.si_code = SI_USER;
3445-
info.si_pid = task_tgid_vnr(current);
3446-
info.si_uid = from_kuid_munged(current_user_ns(), current_uid());
3453+
prepare_kill_siginfo(sig, &info);
34473454

34483455
return kill_something_info(sig, &info, pid);
34493456
}
34503457

3458+
#ifdef CONFIG_PROC_FS
3459+
/*
3460+
* Verify that the signaler and signalee either are in the same pid namespace
3461+
* or that the signaler's pid namespace is an ancestor of the signalee's pid
3462+
* namespace.
3463+
*/
3464+
static bool access_pidfd_pidns(struct pid *pid)
3465+
{
3466+
struct pid_namespace *active = task_active_pid_ns(current);
3467+
struct pid_namespace *p = ns_of_pid(pid);
3468+
3469+
for (;;) {
3470+
if (!p)
3471+
return false;
3472+
if (p == active)
3473+
break;
3474+
p = p->parent;
3475+
}
3476+
3477+
return true;
3478+
}
3479+
3480+
static int copy_siginfo_from_user_any(kernel_siginfo_t *kinfo, siginfo_t *info)
3481+
{
3482+
#ifdef CONFIG_COMPAT
3483+
/*
3484+
* Avoid hooking up compat syscalls and instead handle necessary
3485+
* conversions here. Note, this is a stop-gap measure and should not be
3486+
* considered a generic solution.
3487+
*/
3488+
if (in_compat_syscall())
3489+
return copy_siginfo_from_user32(
3490+
kinfo, (struct compat_siginfo __user *)info);
3491+
#endif
3492+
return copy_siginfo_from_user(kinfo, info);
3493+
}
3494+
3495+
/**
3496+
* sys_pidfd_send_signal - send a signal to a process through a task file
3497+
* descriptor
3498+
* @pidfd: the file descriptor of the process
3499+
* @sig: signal to be sent
3500+
* @info: the signal info
3501+
* @flags: future flags to be passed
3502+
*
3503+
* The syscall currently only signals via PIDTYPE_PID which covers
3504+
* kill(<positive-pid>, <signal>. It does not signal threads or process
3505+
* groups.
3506+
* In order to extend the syscall to threads and process groups the @flags
3507+
* argument should be used. In essence, the @flags argument will determine
3508+
* what is signaled and not the file descriptor itself. Put in other words,
3509+
* grouping is a property of the flags argument not a property of the file
3510+
* descriptor.
3511+
*
3512+
* Return: 0 on success, negative errno on failure
3513+
*/
3514+
SYSCALL_DEFINE4(pidfd_send_signal, int, pidfd, int, sig,
3515+
siginfo_t __user *, info, unsigned int, flags)
3516+
{
3517+
int ret;
3518+
struct fd f;
3519+
struct pid *pid;
3520+
kernel_siginfo_t kinfo;
3521+
3522+
/* Enforce flags be set to 0 until we add an extension. */
3523+
if (flags)
3524+
return -EINVAL;
3525+
3526+
f = fdget_raw(pidfd);
3527+
if (!f.file)
3528+
return -EBADF;
3529+
3530+
/* Is this a pidfd? */
3531+
pid = tgid_pidfd_to_pid(f.file);
3532+
if (IS_ERR(pid)) {
3533+
ret = PTR_ERR(pid);
3534+
goto err;
3535+
}
3536+
3537+
ret = -EINVAL;
3538+
if (!access_pidfd_pidns(pid))
3539+
goto err;
3540+
3541+
if (info) {
3542+
ret = copy_siginfo_from_user_any(&kinfo, info);
3543+
if (unlikely(ret))
3544+
goto err;
3545+
3546+
ret = -EINVAL;
3547+
if (unlikely(sig != kinfo.si_signo))
3548+
goto err;
3549+
3550+
if ((task_pid(current) != pid) &&
3551+
(kinfo.si_code >= 0 || kinfo.si_code == SI_TKILL)) {
3552+
/* Only allow sending arbitrary signals to yourself. */
3553+
ret = -EPERM;
3554+
if (kinfo.si_code != SI_USER)
3555+
goto err;
3556+
3557+
/* Turn this into a regular kill signal. */
3558+
prepare_kill_siginfo(sig, &kinfo);
3559+
}
3560+
} else {
3561+
prepare_kill_siginfo(sig, &kinfo);
3562+
}
3563+
3564+
ret = kill_pid_info(sig, &kinfo, pid);
3565+
3566+
err:
3567+
fdput(f);
3568+
return ret;
3569+
}
3570+
#endif /* CONFIG_PROC_FS */
3571+
34513572
static int
34523573
do_send_specific(pid_t tgid, pid_t pid, int sig, struct kernel_siginfo *info)
34533574
{

kernel/sys_ni.c

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -163,6 +163,7 @@ COND_SYSCALL(syslog);
163163
/* kernel/sched/core.c */
164164

165165
/* kernel/signal.c */
166+
COND_SYSCALL(pidfd_send_signal);
166167

167168
/* kernel/sys.c */
168169
COND_SYSCALL(setregid);

0 commit comments

Comments
 (0)