Skip to content

fix: Close peer negotiate mutex if we haven't negotiated #1774

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
May 27, 2022

Conversation

kylecarbs
Copy link
Member

@kylecarbs kylecarbs commented May 26, 2022

Closes #1706 and #1644.

The sleep was taken out a while back due to the belief that it might be fixed upstream, but that turned out to not be true 😢.

@kylecarbs kylecarbs requested a review from coadler May 26, 2022 03:36
@kylecarbs kylecarbs marked this pull request as ready for review May 26, 2022 03:37
@kylecarbs kylecarbs self-assigned this May 26, 2022
@kylecarbs kylecarbs requested a review from johnstcn May 26, 2022 03:48
Copy link
Member

@johnstcn johnstcn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seems to be some condition we can hit fairly often that will cause us trying to unlock the negotiateMutex twice.

Maybe instead of an atomic plus mutex, we should wrap a regular bool in a RWMutex instead?
I tried just throwing in a recover in another defer but that didn't seem to help here.

peer/conn.go Outdated
c.negotiateMutex.Lock()
}
c.hasNegotiated = true
defer c.negotiateMutex.Unlock()
Copy link
Member

@johnstcn johnstcn May 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately I can fairly reliably reproduce this issue:

fatal error: sync: unlock of unlocked mutex

goroutine 34 [running]:
runtime.throw({0x100648300?, 0x10050a930?})
	/opt/homebrew/Cellar/go/1.18.2/libexec/src/runtime/panic.go:992 +0x50 fp=0x140000a7640 sp=0x140000a7610 pc=0x100141ec0
sync.throw({0x100648300?, 0x1009447ae?})
	/opt/homebrew/Cellar/go/1.18.2/libexec/src/runtime/panic.go:978 +0x24 fp=0x140000a7660 sp=0x140000a7640 pc=0x10016fd44
sync.(*Mutex).unlockSlow(0x140000e60a0, 0xffffffff)
	/opt/homebrew/Cellar/go/1.18.2/libexec/src/sync/mutex.go:220 +0x3c fp=0x140000a7690 sp=0x140000a7660 pc=0x10017e9dc
sync.(*Mutex).Unlock(0x0?)
	/opt/homebrew/Cellar/go/1.18.2/libexec/src/sync/mutex.go:214 +0x64 fp=0x140000a76b0 sp=0x140000a7690 pc=0x10017e974
github.com/coder/coder/peer.(*Conn).negotiate.func1()
	/Users/cian/src/cdr/coder/peer/conn.go:296 +0x2c fp=0x140000a76d0 sp=0x140000a76b0 pc=0x10050b7ac
github.com/coder/coder/peer.(*Conn).negotiate(0x140000e6000)
	/Users/cian/src/cdr/coder/peer/conn.go:302 +0x2f0 fp=0x140000a7ec0 sp=0x140000a76d0 pc=0x10050a950
github.com/coder/coder/peer.(*Conn).negotiate-fm()
	<autogenerated>:1 +0x2c fp=0x140000a7ee0 sp=0x140000a7ec0 pc=0x10050e3fc
github.com/pion/webrtc/v3.(*PeerConnection).negotiationNeededOp(0x140000bc000)
	/Users/cian/go/pkg/mod/github.com/pion/webrtc/v3@v3.1.41/peerconnection.go:353 +0x174 fp=0x140000a7f50 sp=0x140000a7ee0 pc=0x1004c5c24
github.com/pion/webrtc/v3.(*PeerConnection).negotiationNeededOp-fm()
	<autogenerated>:1 +0x2c fp=0x140000a7f70 sp=0x140000a7f50 pc=0x1004f171c
github.com/pion/webrtc/v3.(*operations).start(0x140000b2018)
	/Users/cian/go/pkg/mod/github.com/pion/webrtc/v3@v3.1.41/operations.go:90 +0x60 fp=0x140000a7fb0 sp=0x140000a7f70 pc=0x1004c4290
github.com/pion/webrtc/v3.(*operations).Enqueue.func1()
	/Users/cian/go/pkg/mod/github.com/pion/webrtc/v3@v3.1.41/operations.go:38 +0x2c fp=0x140000a7fd0 sp=0x140000a7fb0 pc=0x1004c3d1c
runtime.goexit()
	/opt/homebrew/Cellar/go/1.18.2/libexec/src/runtime/asm_arm64.s:1263 +0x4 fp=0x140000a7fd0 sp=0x140000a7fd0 pc=0x1001756e4
created by github.com/pion/webrtc/v3.(*operations).Enqueue
	/Users/cian/go/pkg/mod/github.com/pion/webrtc/v3@v3.1.41/operations.go:38 +0x274

EDIT: see Dean's comment below

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh oops, yes I have this backwards!

@deansheather deansheather self-requested a review May 26, 2022 09:49
peer/conn.go Outdated
Comment on lines 607 to 617
// It's possible the connection can be closed before negotiation has
// began. In this case, we want to unlock the mutex to unblock any
// pending candidates being flushed.
if !c.hasNegotiated.Load() {
c.negotiateMutex.Unlock()
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You shouldn't need this code at all. None of the negotiation code or ICE code should be blocking forever on close (or else we have bigger problems), and the unlocks are deferred so it should be fine.

This is what causes the panics btw

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can block on AddRemoteCandidate since that needs to wait for negotiation to complete.

peer/conn.go Outdated
case <-c.negotiated:
default:
close(c.negotiated)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't you close(c.closed) sooner?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nah, but you did bring up a good point about the second close not being needed. Fixed!

@kylecarbs kylecarbs enabled auto-merge (squash) May 27, 2022 17:24
@kylecarbs kylecarbs merged commit 984dc2b into main May 27, 2022
@kylecarbs kylecarbs deleted the negotiatemutex branch May 27, 2022 17:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug: Agent test fails with goleak error
3 participants