Skip to content

feat: implement thin vertical slice of system-generated notifications #13537

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 30 commits into from
Jul 8, 2024

Conversation

dannykopping
Copy link
Contributor

@dannykopping dannykopping commented Jun 11, 2024

NOTE: this is a stacked PR, please review #13536 as well.

This introduces a new subsystem into Coder to enable notifications (see milestone). In this PR, I have implemented a very thin vertical slice of this functionality:

  • immutable system-generated notifications only (later we will expand this to operator-managed notifications)
  • SMTP and webhook delivery only
    • SMTP auth & TLS is not yet enabled; will be added subsequently
    • additional notification methods to come later as well

This functionality is also hidden behind an experiment.


The main entities in this package are:

Notifier is responsible for:

  • periodically dequeuing pending notification_messages entries
  • preparing them for delivery
  • delivering them
  • informing the Manager of successful/failed deliveries

Manager is a controller which is responsible for:

  • facilitating the enqueuing of new messages
  • starting up queue consumers (Notifiers)
  • periodically updating the store with delivery updates
  • managing its own and the Notifiers' lifecycles (run/shutdown)

The queue has been designed to be safe for concurrent use. This can either mean multiple coderd instances running side-by-side and/or each coderd having multiple queue consumers.

Copy link
Contributor Author

dannykopping commented Jun 11, 2024

@dannykopping dannykopping force-pushed the dk/system-notifications-lib branch 2 times, most recently from a1b4e2b to d2a947a Compare June 11, 2024 12:52
@dannykopping dannykopping force-pushed the dk/system-notifications-database branch from 1cb776a to 25636ec Compare June 11, 2024 13:15
@dannykopping dannykopping force-pushed the dk/system-notifications-lib branch 2 times, most recently from dc359de to 2e16d76 Compare June 11, 2024 13:40
@dannykopping dannykopping changed the title feat: system-generated notifications feat: implement thin vertical slice of system-generated notifications Jun 11, 2024
@dannykopping dannykopping marked this pull request as ready for review June 11, 2024 14:13
@dannykopping dannykopping force-pushed the dk/system-notifications-lib branch from d8f51ce to 284b781 Compare June 12, 2024 08:33
@dannykopping dannykopping force-pushed the dk/system-notifications-database branch from 1c4046c to 538bc42 Compare June 12, 2024 09:08
@dannykopping dannykopping force-pushed the dk/system-notifications-lib branch from 284b781 to 9fd8527 Compare June 12, 2024 10:50
@dannykopping dannykopping force-pushed the dk/system-notifications-database branch from 538bc42 to 438bcba Compare June 12, 2024 10:51
@github-actions github-actions bot added the stale This issue is like stale bread. label Jun 20, 2024
@github-actions github-actions bot closed this Jun 23, 2024
@johnstcn johnstcn reopened this Jun 23, 2024
@johnstcn johnstcn removed the stale This issue is like stale bread. label Jun 23, 2024
@dannykopping dannykopping force-pushed the dk/system-notifications-database branch from 438bcba to c1a3010 Compare June 27, 2024 09:46
@dannykopping dannykopping force-pushed the dk/system-notifications-lib branch from 9fd8527 to 8b7fa18 Compare June 27, 2024 09:47
Base automatically changed from dk/system-notifications-database to main June 28, 2024 09:21
Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
Copy link
Member

@johnstcn johnstcn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments below but I don't need to review again. 👍

require.Eventually(t, func() bool { return santa.naughty.Load() == 1 && santa.nice.Load() == 2 }, testutil.WaitMedium, testutil.IntervalFast)

// Stop the manager which forces an update of buffered updates.
require.NoError(t, mgr.Stop(ctx))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is a race condition here, which is that above we wait for notifications to be sent to santa, but sending to santa happens before sending on the success and failure channels. So, triggering stop and the final bulk update might miss one or more success/failure updates and cause this test to flake.

Quartz can help here. The manager's bulk update and the notifier's processing loop depend on Tickers. If you convert them to use a quartz.TickerFunc, then you can control the mock clock and trigger the processing loop and bulk update, and wait for them to finish.

This lets us get rid of the Eventually that checks for dispatched notifications (Quartz allows you to wait until the tick processing is done, so there is no race in just asserting santa got what we expected).

It also means we can get rid of the Eventually below for the same reason --- we can wait for bulk processing to complete before asserting the interceptor's state.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've got a follow-up item to implement Quartz in a subsequent PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 613e074

Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
Copy link
Member

@mtojek mtojek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work! It's a pity we couldn't get this PR in smaller chunks.

I left a few comments, please forgive if some of them were already discussed. I didn't track the whole thread.


// WebhookPayload describes the JSON payload to be delivered to the configured webhook endpoint.
type WebhookPayload struct {
Version string `json:"_version"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: that needs to be published in our Coder docs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, will add as a follow-up.

return ctx.Err()
default:
logger.Warn(ctx, "message dispatch failed", slog.Error(err))
failure <- newFailedDispatch(n.id, msg.ID, err, retryable)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still racy. What you want is

select {
    case <-ctx.Done():
        return ctx.Err()
    case failure <- newFailedDispatch(n.id, msg.ID, err, retryable):
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that default will only be called if there are no cases which can proceed.
https://go.dev/ref/spec#Select_statements

Execution of a "select" statement proceeds in several steps:
...
2. If one or more of the communications can proceed, a single one that can proceed is chosen via a uniform pseudo-random selection. Otherwise, if there is a default case, that case is chosen. ...

Am I misunderstanding something?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's true that default only executes if it is the only case that can succeed, but then writing to failure is a TOCTOU race.

Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
Copy link
Member

@johnstcn johnstcn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I smoke-tested enabling the experiment with no further configuration via docker-compose, and didn't see anything immediately break.

@dannykopping
Copy link
Contributor Author

@spikecurtis since you appear to be on PTO this week, and since I have made a reasonable effort to address all of your comments, I'm going to merge this. I've resolved all of the trivial comments and kept the substantive ones open to prompt further discussion when you return. This is behind an experiment and is reasonably self-contained, so I feel merging this is quite safe. Thanks very much for your thoughtful reviews!

@dannykopping dannykopping requested a review from spikecurtis July 8, 2024 12:23
@dannykopping dannykopping merged commit bdd2caf into main Jul 8, 2024
34 checks passed
@dannykopping dannykopping deleted the dk/system-notifications-lib branch July 8, 2024 13:38
@github-actions github-actions bot locked and limited conversation to collaborators Jul 8, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants