Skip to content

Commit 1a1fb98

Browse files
committed
futex: Handle early deadlock return correctly
commit 56222b2 ("futex: Drop hb->lock before enqueueing on the rtmutex") changed the locking rules in the futex code so that the hash bucket lock is not longer held while the waiter is enqueued into the rtmutex wait list. This made the lock and the unlock path symmetric, but unfortunately the possible early exit from __rt_mutex_proxy_start() due to a detected deadlock was not updated accordingly. That allows a concurrent unlocker to observe inconsitent state which triggers the warning in the unlock path. futex_lock_pi() futex_unlock_pi() lock(hb->lock) queue(hb_waiter) lock(hb->lock) lock(rtmutex->wait_lock) unlock(hb->lock) // acquired hb->lock hb_waiter = futex_top_waiter() lock(rtmutex->wait_lock) __rt_mutex_proxy_start() ---> fail remove(rtmutex_waiter); ---> returns -EDEADLOCK unlock(rtmutex->wait_lock) // acquired wait_lock wake_futex_pi() rt_mutex_next_owner() --> returns NULL --> WARN lock(hb->lock) unqueue(hb_waiter) The problem is caused by the remove(rtmutex_waiter) in the failure case of __rt_mutex_proxy_start() as this lets the unlocker observe a waiter in the hash bucket but no waiter on the rtmutex, i.e. inconsistent state. The original commit handles this correctly for the other early return cases (timeout, signal) by delaying the removal of the rtmutex waiter until the returning task reacquired the hash bucket lock. Treat the failure case of __rt_mutex_proxy_start() in the same way and let the existing cleanup code handle the eventual handover of the rtmutex gracefully. The regular rt_mutex_proxy_start() gains the rtmutex waiter removal for the failure case, so that the other callsites are still operating correctly. Add proper comments to the code so all these details are fully documented. Thanks to Peter for helping with the analysis and writing the really valuable code comments. Fixes: 56222b2 ("futex: Drop hb->lock before enqueueing on the rtmutex") Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com> Co-developed-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: linux-s390@vger.kernel.org Cc: Stefan Liebler <stli@linux.ibm.com> Cc: Sebastian Sewior <bigeasy@linutronix.de> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1901292311410.1950@nanos.tec.linutronix.de
1 parent 6f568eb commit 1a1fb98

File tree

2 files changed

+50
-15
lines changed

2 files changed

+50
-15
lines changed

kernel/futex.c

Lines changed: 18 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -2861,35 +2861,39 @@ static int futex_lock_pi(u32 __user *uaddr, unsigned int flags,
28612861
* and BUG when futex_unlock_pi() interleaves with this.
28622862
*
28632863
* Therefore acquire wait_lock while holding hb->lock, but drop the
2864-
* latter before calling rt_mutex_start_proxy_lock(). This still fully
2865-
* serializes against futex_unlock_pi() as that does the exact same
2866-
* lock handoff sequence.
2864+
* latter before calling __rt_mutex_start_proxy_lock(). This
2865+
* interleaves with futex_unlock_pi() -- which does a similar lock
2866+
* handoff -- such that the latter can observe the futex_q::pi_state
2867+
* before __rt_mutex_start_proxy_lock() is done.
28672868
*/
28682869
raw_spin_lock_irq(&q.pi_state->pi_mutex.wait_lock);
28692870
spin_unlock(q.lock_ptr);
2871+
/*
2872+
* __rt_mutex_start_proxy_lock() unconditionally enqueues the @rt_waiter
2873+
* such that futex_unlock_pi() is guaranteed to observe the waiter when
2874+
* it sees the futex_q::pi_state.
2875+
*/
28702876
ret = __rt_mutex_start_proxy_lock(&q.pi_state->pi_mutex, &rt_waiter, current);
28712877
raw_spin_unlock_irq(&q.pi_state->pi_mutex.wait_lock);
28722878

28732879
if (ret) {
28742880
if (ret == 1)
28752881
ret = 0;
2876-
2877-
spin_lock(q.lock_ptr);
2878-
goto no_block;
2882+
goto cleanup;
28792883
}
28802884

2881-
28822885
if (unlikely(to))
28832886
hrtimer_start_expires(&to->timer, HRTIMER_MODE_ABS);
28842887

28852888
ret = rt_mutex_wait_proxy_lock(&q.pi_state->pi_mutex, to, &rt_waiter);
28862889

2890+
cleanup:
28872891
spin_lock(q.lock_ptr);
28882892
/*
2889-
* If we failed to acquire the lock (signal/timeout), we must
2893+
* If we failed to acquire the lock (deadlock/signal/timeout), we must
28902894
* first acquire the hb->lock before removing the lock from the
2891-
* rt_mutex waitqueue, such that we can keep the hb and rt_mutex
2892-
* wait lists consistent.
2895+
* rt_mutex waitqueue, such that we can keep the hb and rt_mutex wait
2896+
* lists consistent.
28932897
*
28942898
* In particular; it is important that futex_unlock_pi() can not
28952899
* observe this inconsistency.
@@ -3013,6 +3017,10 @@ static int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
30133017
* there is no point where we hold neither; and therefore
30143018
* wake_futex_pi() must observe a state consistent with what we
30153019
* observed.
3020+
*
3021+
* In particular; this forces __rt_mutex_start_proxy() to
3022+
* complete such that we're guaranteed to observe the
3023+
* rt_waiter. Also see the WARN in wake_futex_pi().
30163024
*/
30173025
raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock);
30183026
spin_unlock(&hb->lock);

kernel/locking/rtmutex.c

Lines changed: 32 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1726,12 +1726,33 @@ void rt_mutex_proxy_unlock(struct rt_mutex *lock,
17261726
rt_mutex_set_owner(lock, NULL);
17271727
}
17281728

1729+
/**
1730+
* __rt_mutex_start_proxy_lock() - Start lock acquisition for another task
1731+
* @lock: the rt_mutex to take
1732+
* @waiter: the pre-initialized rt_mutex_waiter
1733+
* @task: the task to prepare
1734+
*
1735+
* Starts the rt_mutex acquire; it enqueues the @waiter and does deadlock
1736+
* detection. It does not wait, see rt_mutex_wait_proxy_lock() for that.
1737+
*
1738+
* NOTE: does _NOT_ remove the @waiter on failure; must either call
1739+
* rt_mutex_wait_proxy_lock() or rt_mutex_cleanup_proxy_lock() after this.
1740+
*
1741+
* Returns:
1742+
* 0 - task blocked on lock
1743+
* 1 - acquired the lock for task, caller should wake it up
1744+
* <0 - error
1745+
*
1746+
* Special API call for PI-futex support.
1747+
*/
17291748
int __rt_mutex_start_proxy_lock(struct rt_mutex *lock,
17301749
struct rt_mutex_waiter *waiter,
17311750
struct task_struct *task)
17321751
{
17331752
int ret;
17341753

1754+
lockdep_assert_held(&lock->wait_lock);
1755+
17351756
if (try_to_take_rt_mutex(lock, task, NULL))
17361757
return 1;
17371758

@@ -1749,9 +1770,6 @@ int __rt_mutex_start_proxy_lock(struct rt_mutex *lock,
17491770
ret = 0;
17501771
}
17511772

1752-
if (unlikely(ret))
1753-
remove_waiter(lock, waiter);
1754-
17551773
debug_rt_mutex_print_deadlock(waiter);
17561774

17571775
return ret;
@@ -1763,12 +1781,18 @@ int __rt_mutex_start_proxy_lock(struct rt_mutex *lock,
17631781
* @waiter: the pre-initialized rt_mutex_waiter
17641782
* @task: the task to prepare
17651783
*
1784+
* Starts the rt_mutex acquire; it enqueues the @waiter and does deadlock
1785+
* detection. It does not wait, see rt_mutex_wait_proxy_lock() for that.
1786+
*
1787+
* NOTE: unlike __rt_mutex_start_proxy_lock this _DOES_ remove the @waiter
1788+
* on failure.
1789+
*
17661790
* Returns:
17671791
* 0 - task blocked on lock
17681792
* 1 - acquired the lock for task, caller should wake it up
17691793
* <0 - error
17701794
*
1771-
* Special API call for FUTEX_REQUEUE_PI support.
1795+
* Special API call for PI-futex support.
17721796
*/
17731797
int rt_mutex_start_proxy_lock(struct rt_mutex *lock,
17741798
struct rt_mutex_waiter *waiter,
@@ -1778,6 +1802,8 @@ int rt_mutex_start_proxy_lock(struct rt_mutex *lock,
17781802

17791803
raw_spin_lock_irq(&lock->wait_lock);
17801804
ret = __rt_mutex_start_proxy_lock(lock, waiter, task);
1805+
if (unlikely(ret))
1806+
remove_waiter(lock, waiter);
17811807
raw_spin_unlock_irq(&lock->wait_lock);
17821808

17831809
return ret;
@@ -1845,7 +1871,8 @@ int rt_mutex_wait_proxy_lock(struct rt_mutex *lock,
18451871
* @lock: the rt_mutex we were woken on
18461872
* @waiter: the pre-initialized rt_mutex_waiter
18471873
*
1848-
* Attempt to clean up after a failed rt_mutex_wait_proxy_lock().
1874+
* Attempt to clean up after a failed __rt_mutex_start_proxy_lock() or
1875+
* rt_mutex_wait_proxy_lock().
18491876
*
18501877
* Unless we acquired the lock; we're still enqueued on the wait-list and can
18511878
* in fact still be granted ownership until we're removed. Therefore we can

0 commit comments

Comments
 (0)