Skip to content

Commit f1ef09f

Browse files
committed
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull namespace updates from Eric Biederman: "There is a lot here. A lot of these changes result in subtle user visible differences in kernel behavior. I don't expect anything will care but I will revert/fix things immediately if any regressions show up. From Seth Forshee there is a continuation of the work to make the vfs ready for unpriviled mounts. We had thought the previous changes prevented the creation of files outside of s_user_ns of a filesystem, but it turns we missed the O_CREAT path. Ooops. Pavel Tikhomirov and Oleg Nesterov worked together to fix a long standing bug in the implemenation of PR_SET_CHILD_SUBREAPER where only children that are forked after the prctl are considered and not children forked before the prctl. The only known user of this prctl systemd forks all children after the prctl. So no userspace regressions will occur. Holding earlier forked children to the same rules as later forked children creates a semantic that is sane enough to allow checkpoing of processes that use this feature. There is a long delayed change by Nikolay Borisov to limit inotify instances inside a user namespace. Michael Kerrisk extends the API for files used to maniuplate namespaces with two new trivial ioctls to allow discovery of the hierachy and properties of namespaces. Konstantin Khlebnikov with the help of Al Viro adds code that when a network namespace exits purges it's sysctl entries from the dcache. As in some circumstances this could use a lot of memory. Vivek Goyal fixed a bug with stacked filesystems where the permissions on the wrong inode were being checked. I continue previous work on ptracing across exec. Allowing a file to be setuid across exec while being ptraced if the tracer has enough credentials in the user namespace, and if the process has CAP_SETUID in it's own namespace. Proc files for setuid or otherwise undumpable executables are now owned by the root in the user namespace of their mm. Allowing debugging of setuid applications in containers to work better. A bug I introduced with permission checking and automount is now fixed. The big change is to mark the mounts that the kernel initiates as a result of an automount. This allows the permission checks in sget to be safely suppressed for this kind of mount. As the permission check happened when the original filesystem was mounted. Finally a special case in the mount namespace is removed preventing unbounded chains in the mount hash table, and making the semantics simpler which benefits CRIU. The vfs fix along with related work in ima and evm I believe makes us ready to finish developing and merge fully unprivileged mounts of the fuse filesystem. The cleanups of the mount namespace makes discussing how to fix the worst case complexity of umount. The stacked filesystem fixes pave the way for adding multiple mappings for the filesystem uids so that efficient and safer containers can be implemented" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: proc/sysctl: Don't grab i_lock under sysctl_lock. vfs: Use upper filesystem inode in bprm_fill_uid() proc/sysctl: prune stale dentries during unregistering mnt: Tuck mounts under others instead of creating shadow/side mounts. prctl: propagate has_child_subreaper flag to every descendant introduce the walk_process_tree() helper nsfs: Add an ioctl() to return owner UID of a userns fs: Better permission checking for submounts exit: fix the setns() && PR_SET_CHILD_SUBREAPER interaction vfs: open() with O_CREAT should not create inodes with unknown ids nsfs: Add an ioctl() to return the namespace type proc: Better ownership of files for non-dumpable tasks in user namespaces exec: Remove LSM_UNSAFE_PTRACE_CAP exec: Test the ptracer's saved cred to see if the tracee can gain caps exec: Don't reset euid and egid when the tracee has CAP_SETUID inotify: Convert to using per-namespace limits
2 parents ef96152 + ace0c79 commit f1ef09f

40 files changed

+431
-226
lines changed

fs/afs/mntpt.c

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -202,7 +202,7 @@ static struct vfsmount *afs_mntpt_do_automount(struct dentry *mntpt)
202202

203203
/* try and do the mount */
204204
_debug("--- attempting mount %s -o %s ---", devname, options);
205-
mnt = vfs_kern_mount(&afs_fs_type, 0, devname, options);
205+
mnt = vfs_submount(mntpt, &afs_fs_type, devname, options);
206206
_debug("--- mount result %p ---", mnt);
207207

208208
free_page((unsigned long) devname);

fs/autofs4/waitq.c

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -436,8 +436,8 @@ int autofs4_wait(struct autofs_sb_info *sbi,
436436
memcpy(&wq->name, &qstr, sizeof(struct qstr));
437437
wq->dev = autofs4_get_dev(sbi);
438438
wq->ino = autofs4_get_ino(sbi);
439-
wq->uid = current_real_cred()->uid;
440-
wq->gid = current_real_cred()->gid;
439+
wq->uid = current_cred()->uid;
440+
wq->gid = current_cred()->gid;
441441
wq->pid = pid;
442442
wq->tgid = tgid;
443443
wq->status = -EINTR; /* Status return if interrupted */

fs/cifs/cifs_dfs_ref.c

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -245,7 +245,8 @@ char *cifs_compose_mount_options(const char *sb_mountdata,
245245
* @fullpath: full path in UNC format
246246
* @ref: server's referral
247247
*/
248-
static struct vfsmount *cifs_dfs_do_refmount(struct cifs_sb_info *cifs_sb,
248+
static struct vfsmount *cifs_dfs_do_refmount(struct dentry *mntpt,
249+
struct cifs_sb_info *cifs_sb,
249250
const char *fullpath, const struct dfs_info3_param *ref)
250251
{
251252
struct vfsmount *mnt;
@@ -259,7 +260,7 @@ static struct vfsmount *cifs_dfs_do_refmount(struct cifs_sb_info *cifs_sb,
259260
if (IS_ERR(mountdata))
260261
return (struct vfsmount *)mountdata;
261262

262-
mnt = vfs_kern_mount(&cifs_fs_type, 0, devname, mountdata);
263+
mnt = vfs_submount(mntpt, &cifs_fs_type, devname, mountdata);
263264
kfree(mountdata);
264265
kfree(devname);
265266
return mnt;
@@ -334,7 +335,7 @@ static struct vfsmount *cifs_dfs_do_automount(struct dentry *mntpt)
334335
mnt = ERR_PTR(-EINVAL);
335336
break;
336337
}
337-
mnt = cifs_dfs_do_refmount(cifs_sb,
338+
mnt = cifs_dfs_do_refmount(mntpt, cifs_sb,
338339
full_path, referrals + i);
339340
cifs_dbg(FYI, "%s: cifs_dfs_do_refmount:%s , mnt:%p\n",
340341
__func__, referrals[i].node_name, mnt);

fs/debugfs/inode.c

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -187,9 +187,9 @@ static const struct super_operations debugfs_super_operations = {
187187

188188
static struct vfsmount *debugfs_automount(struct path *path)
189189
{
190-
struct vfsmount *(*f)(void *);
191-
f = (struct vfsmount *(*)(void *))path->dentry->d_fsdata;
192-
return f(d_inode(path->dentry)->i_private);
190+
debugfs_automount_t f;
191+
f = (debugfs_automount_t)path->dentry->d_fsdata;
192+
return f(path->dentry, d_inode(path->dentry)->i_private);
193193
}
194194

195195
static const struct dentry_operations debugfs_dops = {
@@ -540,7 +540,7 @@ EXPORT_SYMBOL_GPL(debugfs_create_dir);
540540
*/
541541
struct dentry *debugfs_create_automount(const char *name,
542542
struct dentry *parent,
543-
struct vfsmount *(*f)(void *),
543+
debugfs_automount_t f,
544544
void *data)
545545
{
546546
struct dentry *dentry = start_creating(name, parent);

fs/exec.c

Lines changed: 3 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1426,12 +1426,8 @@ static void check_unsafe_exec(struct linux_binprm *bprm)
14261426
struct task_struct *p = current, *t;
14271427
unsigned n_fs;
14281428

1429-
if (p->ptrace) {
1430-
if (ptracer_capable(p, current_user_ns()))
1431-
bprm->unsafe |= LSM_UNSAFE_PTRACE_CAP;
1432-
else
1433-
bprm->unsafe |= LSM_UNSAFE_PTRACE;
1434-
}
1429+
if (p->ptrace)
1430+
bprm->unsafe |= LSM_UNSAFE_PTRACE;
14351431

14361432
/*
14371433
* This isn't strictly necessary, but it makes it harder for LSMs to
@@ -1479,7 +1475,7 @@ static void bprm_fill_uid(struct linux_binprm *bprm)
14791475
if (task_no_new_privs(current))
14801476
return;
14811477

1482-
inode = file_inode(bprm->file);
1478+
inode = bprm->file->f_path.dentry->d_inode;
14831479
mode = READ_ONCE(inode->i_mode);
14841480
if (!(mode & (S_ISUID|S_ISGID)))
14851481
return;

fs/mount.h

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -89,7 +89,6 @@ static inline int is_mounted(struct vfsmount *mnt)
8989
}
9090

9191
extern struct mount *__lookup_mnt(struct vfsmount *, struct dentry *);
92-
extern struct mount *__lookup_mnt_last(struct vfsmount *, struct dentry *);
9392

9493
extern int __legitimize_mnt(struct vfsmount *, unsigned);
9594
extern bool legitimize_mnt(struct vfsmount *, unsigned);

fs/namei.c

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1100,7 +1100,6 @@ static int follow_automount(struct path *path, struct nameidata *nd,
11001100
bool *need_mntput)
11011101
{
11021102
struct vfsmount *mnt;
1103-
const struct cred *old_cred;
11041103
int err;
11051104

11061105
if (!path->dentry->d_op || !path->dentry->d_op->d_automount)
@@ -1129,9 +1128,7 @@ static int follow_automount(struct path *path, struct nameidata *nd,
11291128
if (nd->total_link_count >= 40)
11301129
return -ELOOP;
11311130

1132-
old_cred = override_creds(&init_cred);
11331131
mnt = path->dentry->d_op->d_automount(path);
1134-
revert_creds(old_cred);
11351132
if (IS_ERR(mnt)) {
11361133
/*
11371134
* The filesystem is allowed to return -EISDIR here to indicate
@@ -2941,10 +2938,16 @@ static inline int open_to_namei_flags(int flag)
29412938

29422939
static int may_o_create(const struct path *dir, struct dentry *dentry, umode_t mode)
29432940
{
2941+
struct user_namespace *s_user_ns;
29442942
int error = security_path_mknod(dir, dentry, mode, 0);
29452943
if (error)
29462944
return error;
29472945

2946+
s_user_ns = dir->dentry->d_sb->s_user_ns;
2947+
if (!kuid_has_mapping(s_user_ns, current_fsuid()) ||
2948+
!kgid_has_mapping(s_user_ns, current_fsgid()))
2949+
return -EOVERFLOW;
2950+
29482951
error = inode_permission(dir->dentry->d_inode, MAY_WRITE | MAY_EXEC);
29492952
if (error)
29502953
return error;

fs/namespace.c

Lines changed: 76 additions & 51 deletions
Original file line numberDiff line numberDiff line change
@@ -636,28 +636,6 @@ struct mount *__lookup_mnt(struct vfsmount *mnt, struct dentry *dentry)
636636
return NULL;
637637
}
638638

639-
/*
640-
* find the last mount at @dentry on vfsmount @mnt.
641-
* mount_lock must be held.
642-
*/
643-
struct mount *__lookup_mnt_last(struct vfsmount *mnt, struct dentry *dentry)
644-
{
645-
struct mount *p, *res = NULL;
646-
p = __lookup_mnt(mnt, dentry);
647-
if (!p)
648-
goto out;
649-
if (!(p->mnt.mnt_flags & MNT_UMOUNT))
650-
res = p;
651-
hlist_for_each_entry_continue(p, mnt_hash) {
652-
if (&p->mnt_parent->mnt != mnt || p->mnt_mountpoint != dentry)
653-
break;
654-
if (!(p->mnt.mnt_flags & MNT_UMOUNT))
655-
res = p;
656-
}
657-
out:
658-
return res;
659-
}
660-
661639
/*
662640
* lookup_mnt - Return the first child mount mounted at path
663641
*
@@ -878,6 +856,13 @@ void mnt_set_mountpoint(struct mount *mnt,
878856
hlist_add_head(&child_mnt->mnt_mp_list, &mp->m_list);
879857
}
880858

859+
static void __attach_mnt(struct mount *mnt, struct mount *parent)
860+
{
861+
hlist_add_head_rcu(&mnt->mnt_hash,
862+
m_hash(&parent->mnt, mnt->mnt_mountpoint));
863+
list_add_tail(&mnt->mnt_child, &parent->mnt_mounts);
864+
}
865+
881866
/*
882867
* vfsmount lock must be held for write
883868
*/
@@ -886,28 +871,45 @@ static void attach_mnt(struct mount *mnt,
886871
struct mountpoint *mp)
887872
{
888873
mnt_set_mountpoint(parent, mp, mnt);
889-
hlist_add_head_rcu(&mnt->mnt_hash, m_hash(&parent->mnt, mp->m_dentry));
890-
list_add_tail(&mnt->mnt_child, &parent->mnt_mounts);
874+
__attach_mnt(mnt, parent);
891875
}
892876

893-
static void attach_shadowed(struct mount *mnt,
894-
struct mount *parent,
895-
struct mount *shadows)
877+
void mnt_change_mountpoint(struct mount *parent, struct mountpoint *mp, struct mount *mnt)
896878
{
897-
if (shadows) {
898-
hlist_add_behind_rcu(&mnt->mnt_hash, &shadows->mnt_hash);
899-
list_add(&mnt->mnt_child, &shadows->mnt_child);
900-
} else {
901-
hlist_add_head_rcu(&mnt->mnt_hash,
902-
m_hash(&parent->mnt, mnt->mnt_mountpoint));
903-
list_add_tail(&mnt->mnt_child, &parent->mnt_mounts);
904-
}
879+
struct mountpoint *old_mp = mnt->mnt_mp;
880+
struct dentry *old_mountpoint = mnt->mnt_mountpoint;
881+
struct mount *old_parent = mnt->mnt_parent;
882+
883+
list_del_init(&mnt->mnt_child);
884+
hlist_del_init(&mnt->mnt_mp_list);
885+
hlist_del_init_rcu(&mnt->mnt_hash);
886+
887+
attach_mnt(mnt, parent, mp);
888+
889+
put_mountpoint(old_mp);
890+
891+
/*
892+
* Safely avoid even the suggestion this code might sleep or
893+
* lock the mount hash by taking advantage of the knowledge that
894+
* mnt_change_mountpoint will not release the final reference
895+
* to a mountpoint.
896+
*
897+
* During mounting, the mount passed in as the parent mount will
898+
* continue to use the old mountpoint and during unmounting, the
899+
* old mountpoint will continue to exist until namespace_unlock,
900+
* which happens well after mnt_change_mountpoint.
901+
*/
902+
spin_lock(&old_mountpoint->d_lock);
903+
old_mountpoint->d_lockref.count--;
904+
spin_unlock(&old_mountpoint->d_lock);
905+
906+
mnt_add_count(old_parent, -1);
905907
}
906908

907909
/*
908910
* vfsmount lock must be held for write
909911
*/
910-
static void commit_tree(struct mount *mnt, struct mount *shadows)
912+
static void commit_tree(struct mount *mnt)
911913
{
912914
struct mount *parent = mnt->mnt_parent;
913915
struct mount *m;
@@ -925,7 +927,7 @@ static void commit_tree(struct mount *mnt, struct mount *shadows)
925927
n->mounts += n->pending_mounts;
926928
n->pending_mounts = 0;
927929

928-
attach_shadowed(mnt, parent, shadows);
930+
__attach_mnt(mnt, parent);
929931
touch_mnt_namespace(n);
930932
}
931933

@@ -989,6 +991,21 @@ vfs_kern_mount(struct file_system_type *type, int flags, const char *name, void
989991
}
990992
EXPORT_SYMBOL_GPL(vfs_kern_mount);
991993

994+
struct vfsmount *
995+
vfs_submount(const struct dentry *mountpoint, struct file_system_type *type,
996+
const char *name, void *data)
997+
{
998+
/* Until it is worked out how to pass the user namespace
999+
* through from the parent mount to the submount don't support
1000+
* unprivileged mounts with submounts.
1001+
*/
1002+
if (mountpoint->d_sb->s_user_ns != &init_user_ns)
1003+
return ERR_PTR(-EPERM);
1004+
1005+
return vfs_kern_mount(type, MS_SUBMOUNT, name, data);
1006+
}
1007+
EXPORT_SYMBOL_GPL(vfs_submount);
1008+
9921009
static struct mount *clone_mnt(struct mount *old, struct dentry *root,
9931010
int flag)
9941011
{
@@ -1764,7 +1781,6 @@ struct mount *copy_tree(struct mount *mnt, struct dentry *dentry,
17641781
continue;
17651782

17661783
for (s = r; s; s = next_mnt(s, r)) {
1767-
struct mount *t = NULL;
17681784
if (!(flag & CL_COPY_UNBINDABLE) &&
17691785
IS_MNT_UNBINDABLE(s)) {
17701786
s = skip_mnt_tree(s);
@@ -1786,14 +1802,7 @@ struct mount *copy_tree(struct mount *mnt, struct dentry *dentry,
17861802
goto out;
17871803
lock_mount_hash();
17881804
list_add_tail(&q->mnt_list, &res->mnt_list);
1789-
mnt_set_mountpoint(parent, p->mnt_mp, q);
1790-
if (!list_empty(&parent->mnt_mounts)) {
1791-
t = list_last_entry(&parent->mnt_mounts,
1792-
struct mount, mnt_child);
1793-
if (t->mnt_mp != p->mnt_mp)
1794-
t = NULL;
1795-
}
1796-
attach_shadowed(q, parent, t);
1805+
attach_mnt(q, parent, p->mnt_mp);
17971806
unlock_mount_hash();
17981807
}
17991808
}
@@ -1992,10 +2001,18 @@ static int attach_recursive_mnt(struct mount *source_mnt,
19922001
{
19932002
HLIST_HEAD(tree_list);
19942003
struct mnt_namespace *ns = dest_mnt->mnt_ns;
2004+
struct mountpoint *smp;
19952005
struct mount *child, *p;
19962006
struct hlist_node *n;
19972007
int err;
19982008

2009+
/* Preallocate a mountpoint in case the new mounts need
2010+
* to be tucked under other mounts.
2011+
*/
2012+
smp = get_mountpoint(source_mnt->mnt.mnt_root);
2013+
if (IS_ERR(smp))
2014+
return PTR_ERR(smp);
2015+
19992016
/* Is there space to add these mounts to the mount namespace? */
20002017
if (!parent_path) {
20012018
err = count_mounts(ns, source_mnt);
@@ -2022,16 +2039,19 @@ static int attach_recursive_mnt(struct mount *source_mnt,
20222039
touch_mnt_namespace(source_mnt->mnt_ns);
20232040
} else {
20242041
mnt_set_mountpoint(dest_mnt, dest_mp, source_mnt);
2025-
commit_tree(source_mnt, NULL);
2042+
commit_tree(source_mnt);
20262043
}
20272044

20282045
hlist_for_each_entry_safe(child, n, &tree_list, mnt_hash) {
20292046
struct mount *q;
20302047
hlist_del_init(&child->mnt_hash);
2031-
q = __lookup_mnt_last(&child->mnt_parent->mnt,
2032-
child->mnt_mountpoint);
2033-
commit_tree(child, q);
2048+
q = __lookup_mnt(&child->mnt_parent->mnt,
2049+
child->mnt_mountpoint);
2050+
if (q)
2051+
mnt_change_mountpoint(child, smp, q);
2052+
commit_tree(child);
20342053
}
2054+
put_mountpoint(smp);
20352055
unlock_mount_hash();
20362056

20372057
return 0;
@@ -2046,6 +2066,11 @@ static int attach_recursive_mnt(struct mount *source_mnt,
20462066
cleanup_group_ids(source_mnt, NULL);
20472067
out:
20482068
ns->pending_mounts = 0;
2069+
2070+
read_seqlock_excl(&mount_lock);
2071+
put_mountpoint(smp);
2072+
read_sequnlock_excl(&mount_lock);
2073+
20492074
return err;
20502075
}
20512076

@@ -2794,7 +2819,7 @@ long do_mount(const char *dev_name, const char __user *dir_name,
27942819

27952820
flags &= ~(MS_NOSUID | MS_NOEXEC | MS_NODEV | MS_ACTIVE | MS_BORN |
27962821
MS_NOATIME | MS_NODIRATIME | MS_RELATIME| MS_KERNMOUNT |
2797-
MS_STRICTATIME | MS_NOREMOTELOCK);
2822+
MS_STRICTATIME | MS_NOREMOTELOCK | MS_SUBMOUNT);
27982823

27992824
if (flags & MS_REMOUNT)
28002825
retval = do_remount(&path, flags & ~MS_REMOUNT, mnt_flags,

fs/nfs/namespace.c

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -226,7 +226,7 @@ static struct vfsmount *nfs_do_clone_mount(struct nfs_server *server,
226226
const char *devname,
227227
struct nfs_clone_mount *mountdata)
228228
{
229-
return vfs_kern_mount(&nfs_xdev_fs_type, 0, devname, mountdata);
229+
return vfs_submount(mountdata->dentry, &nfs_xdev_fs_type, devname, mountdata);
230230
}
231231

232232
/**

fs/nfs/nfs4namespace.c

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -279,7 +279,7 @@ static struct vfsmount *try_location(struct nfs_clone_mount *mountdata,
279279
mountdata->hostname,
280280
mountdata->mnt_path);
281281

282-
mnt = vfs_kern_mount(&nfs4_referral_fs_type, 0, page, mountdata);
282+
mnt = vfs_submount(mountdata->dentry, &nfs4_referral_fs_type, page, mountdata);
283283
if (!IS_ERR(mnt))
284284
break;
285285
}

fs/notify/inotify/inotify.h

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -30,3 +30,20 @@ extern int inotify_handle_event(struct fsnotify_group *group,
3030
const unsigned char *file_name, u32 cookie);
3131

3232
extern const struct fsnotify_ops inotify_fsnotify_ops;
33+
34+
#ifdef CONFIG_INOTIFY_USER
35+
static inline void dec_inotify_instances(struct ucounts *ucounts)
36+
{
37+
dec_ucount(ucounts, UCOUNT_INOTIFY_INSTANCES);
38+
}
39+
40+
static inline struct ucounts *inc_inotify_watches(struct ucounts *ucounts)
41+
{
42+
return inc_ucount(ucounts->ns, ucounts->uid, UCOUNT_INOTIFY_WATCHES);
43+
}
44+
45+
static inline void dec_inotify_watches(struct ucounts *ucounts)
46+
{
47+
dec_ucount(ucounts, UCOUNT_INOTIFY_WATCHES);
48+
}
49+
#endif

0 commit comments

Comments
 (0)