Skip to content

Commit b0c29f7

Browse files
Davidlohr BuesoIngo Molnar
authored andcommitted
futexes: Avoid taking the hb->lock if there's nothing to wake up
In futex_wake() there is clearly no point in taking the hb->lock if we know beforehand that there are no tasks to be woken. While the hash bucket's plist head is a cheap way of knowing this, we cannot rely 100% on it as there is a racy window between the futex_wait call and when the task is actually added to the plist. To this end, we couple it with the spinlock check as tasks trying to enter the critical region are most likely potential waiters that will be added to the plist, thus preventing tasks sleeping forever if wakers don't acknowledge all possible waiters. Furthermore, the futex ordering guarantees are preserved, ensuring that waiters either observe the changed user space value before blocking or is woken by a concurrent waker. For wakers, this is done by relying on the barriers in get_futex_key_refs() -- for archs that do not have implicit mb in atomic_inc(), we explicitly add them through a new futex_get_mm function. For waiters we rely on the fact that spin_lock calls already update the head counter, so spinners are visible even if the lock hasn't been acquired yet. For more details please refer to the updated comments in the code and related discussion: https://lkml.org/lkml/2013/11/26/556 Special thanks to tglx for careful review and feedback. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Reviewed-by: Darren Hart <dvhart@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Davidlohr Bueso <davidlohr@hp.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Jeff Mahoney <jeffm@suse.com> Cc: Scott Norton <scott.norton@hp.com> Cc: Tom Vaden <tom.vaden@hp.com> Cc: Aswin Chandramouleeswaran <aswin@hp.com> Cc: Waiman Long <Waiman.Long@hp.com> Cc: Jason Low <jason.low2@hp.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/1389569486-25487-5-git-send-email-davidlohr@hp.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
1 parent 99b60ce commit b0c29f7

File tree

1 file changed

+92
-25
lines changed

1 file changed

+92
-25
lines changed

kernel/futex.c

Lines changed: 92 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -75,17 +75,20 @@
7575
* The waiter reads the futex value in user space and calls
7676
* futex_wait(). This function computes the hash bucket and acquires
7777
* the hash bucket lock. After that it reads the futex user space value
78-
* again and verifies that the data has not changed. If it has not
79-
* changed it enqueues itself into the hash bucket, releases the hash
80-
* bucket lock and schedules.
78+
* again and verifies that the data has not changed. If it has not changed
79+
* it enqueues itself into the hash bucket, releases the hash bucket lock
80+
* and schedules.
8181
*
8282
* The waker side modifies the user space value of the futex and calls
83-
* futex_wake(). This functions computes the hash bucket and acquires
84-
* the hash bucket lock. Then it looks for waiters on that futex in the
85-
* hash bucket and wakes them.
83+
* futex_wake(). This function computes the hash bucket and acquires the
84+
* hash bucket lock. Then it looks for waiters on that futex in the hash
85+
* bucket and wakes them.
8686
*
87-
* Note that the spin_lock serializes waiters and wakers, so that the
88-
* following scenario is avoided:
87+
* In futex wake up scenarios where no tasks are blocked on a futex, taking
88+
* the hb spinlock can be avoided and simply return. In order for this
89+
* optimization to work, ordering guarantees must exist so that the waiter
90+
* being added to the list is acknowledged when the list is concurrently being
91+
* checked by the waker, avoiding scenarios like the following:
8992
*
9093
* CPU 0 CPU 1
9194
* val = *futex;
@@ -106,24 +109,52 @@
106109
* This would cause the waiter on CPU 0 to wait forever because it
107110
* missed the transition of the user space value from val to newval
108111
* and the waker did not find the waiter in the hash bucket queue.
109-
* The spinlock serializes that:
110112
*
111-
* CPU 0 CPU 1
113+
* The correct serialization ensures that a waiter either observes
114+
* the changed user space value before blocking or is woken by a
115+
* concurrent waker:
116+
*
117+
* CPU 0 CPU 1
112118
* val = *futex;
113119
* sys_futex(WAIT, futex, val);
114120
* futex_wait(futex, val);
115-
* lock(hash_bucket(futex));
116-
* uval = *futex;
117-
* *futex = newval;
118-
* sys_futex(WAKE, futex);
119-
* futex_wake(futex);
120-
* lock(hash_bucket(futex));
121+
*
122+
* waiters++;
123+
* mb(); (A) <-- paired with -.
124+
* |
125+
* lock(hash_bucket(futex)); |
126+
* |
127+
* uval = *futex; |
128+
* | *futex = newval;
129+
* | sys_futex(WAKE, futex);
130+
* | futex_wake(futex);
131+
* |
132+
* `-------> mb(); (B)
121133
* if (uval == val)
122-
* queue();
134+
* queue();
123135
* unlock(hash_bucket(futex));
124-
* schedule(); if (!queue_empty())
125-
* wake_waiters(futex);
126-
* unlock(hash_bucket(futex));
136+
* schedule(); if (waiters)
137+
* lock(hash_bucket(futex));
138+
* wake_waiters(futex);
139+
* unlock(hash_bucket(futex));
140+
*
141+
* Where (A) orders the waiters increment and the futex value read -- this
142+
* is guaranteed by the head counter in the hb spinlock; and where (B)
143+
* orders the write to futex and the waiters read -- this is done by the
144+
* barriers in get_futex_key_refs(), through either ihold or atomic_inc,
145+
* depending on the futex type.
146+
*
147+
* This yields the following case (where X:=waiters, Y:=futex):
148+
*
149+
* X = Y = 0
150+
*
151+
* w[X]=1 w[Y]=1
152+
* MB MB
153+
* r[Y]=y r[X]=x
154+
*
155+
* Which guarantees that x==0 && y==0 is impossible; which translates back into
156+
* the guarantee that we cannot both miss the futex variable change and the
157+
* enqueue.
127158
*/
128159

129160
int __read_mostly futex_cmpxchg_enabled;
@@ -211,6 +242,36 @@ static unsigned long __read_mostly futex_hashsize;
211242

212243
static struct futex_hash_bucket *futex_queues;
213244

245+
static inline void futex_get_mm(union futex_key *key)
246+
{
247+
atomic_inc(&key->private.mm->mm_count);
248+
/*
249+
* Ensure futex_get_mm() implies a full barrier such that
250+
* get_futex_key() implies a full barrier. This is relied upon
251+
* as full barrier (B), see the ordering comment above.
252+
*/
253+
smp_mb__after_atomic_inc();
254+
}
255+
256+
static inline bool hb_waiters_pending(struct futex_hash_bucket *hb)
257+
{
258+
#ifdef CONFIG_SMP
259+
/*
260+
* Tasks trying to enter the critical region are most likely
261+
* potential waiters that will be added to the plist. Ensure
262+
* that wakers won't miss to-be-slept tasks in the window between
263+
* the wait call and the actual plist_add.
264+
*/
265+
if (spin_is_locked(&hb->lock))
266+
return true;
267+
smp_rmb(); /* Make sure we check the lock state first */
268+
269+
return !plist_head_empty(&hb->chain);
270+
#else
271+
return true;
272+
#endif
273+
}
274+
214275
/*
215276
* We hash on the keys returned from get_futex_key (see below).
216277
*/
@@ -245,10 +306,10 @@ static void get_futex_key_refs(union futex_key *key)
245306

246307
switch (key->both.offset & (FUT_OFF_INODE|FUT_OFF_MMSHARED)) {
247308
case FUT_OFF_INODE:
248-
ihold(key->shared.inode);
309+
ihold(key->shared.inode); /* implies MB (B) */
249310
break;
250311
case FUT_OFF_MMSHARED:
251-
atomic_inc(&key->private.mm->mm_count);
312+
futex_get_mm(key); /* implies MB (B) */
252313
break;
253314
}
254315
}
@@ -322,7 +383,7 @@ get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key, int rw)
322383
if (!fshared) {
323384
key->private.mm = mm;
324385
key->private.address = address;
325-
get_futex_key_refs(key);
386+
get_futex_key_refs(key); /* implies MB (B) */
326387
return 0;
327388
}
328389

@@ -429,7 +490,7 @@ get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key, int rw)
429490
key->shared.pgoff = basepage_index(page);
430491
}
431492

432-
get_futex_key_refs(key);
493+
get_futex_key_refs(key); /* implies MB (B) */
433494

434495
out:
435496
unlock_page(page_head);
@@ -1052,6 +1113,11 @@ futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
10521113
goto out;
10531114

10541115
hb = hash_futex(&key);
1116+
1117+
/* Make sure we really have tasks to wakeup */
1118+
if (!hb_waiters_pending(hb))
1119+
goto out_put_key;
1120+
10551121
spin_lock(&hb->lock);
10561122

10571123
plist_for_each_entry_safe(this, next, &hb->chain, list) {
@@ -1072,6 +1138,7 @@ futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
10721138
}
10731139

10741140
spin_unlock(&hb->lock);
1141+
out_put_key:
10751142
put_futex_key(&key);
10761143
out:
10771144
return ret;
@@ -1535,7 +1602,7 @@ static inline struct futex_hash_bucket *queue_lock(struct futex_q *q)
15351602
hb = hash_futex(&q->key);
15361603
q->lock_ptr = &hb->lock;
15371604

1538-
spin_lock(&hb->lock);
1605+
spin_lock(&hb->lock); /* implies MB (A) */
15391606
return hb;
15401607
}
15411608

0 commit comments

Comments
 (0)