feat: implement observability of notifications subsystem #13799

dannykopping · 2024-07-08T07:35:08Z

Also performed some drive-by refactorings.

NOTE: this is stacked on `dk/system-notifications-lib, but I didn't want to use Graphite for fear it screws up that branch which is close to being merged; I'm still pretty new to that tool.

coderd/notifications/notifier.go

mafredri

FWIW the changes LGTM in isolation (I haven't reviewed the original code).

mafredri · 2024-07-08T09:19:21Z

coderd/database/migrations/000223_notifications_metrics.up.sql

@@ -0,0 +1,2 @@
+ALTER TABLE notification_messages
+ADD COLUMN queued_seconds FLOAT NULL;


Nit: The name queued_seconds implies an integer, consider changing the type or the name.

I don't think it necessarily implies integer; time.Duration's Seconds() returns a float64.

Fair enough 😄.

In the DB, usually it means seconds:

lifetime_seconds bigint DEFAULT 86400 NOT NULL, timeout_seconds integer NOT NULL connection_timeout_seconds integer DEFAULT 0 NOT NULL,

But fine to keep as float if we care about the precision (although I'd prefer to multiply the number to required precision in int).

I suppose I don't strictly need to keep sub-second precision on the length of time messages were queued for.
I'll switch to integer 👍

This is backing a Prometheus metric, which are always floats, and by convention end in the SI unit, which for time is seconds. It implies no precision.

This should be a float: there just isn't any compelling reason to pre-judge the required precision. This is not a banking application, nor do we need to worry about the relative speed of integer vs floating point arithmetic.

The name queued_seconds is appropriate.

coderd/notifications/manager.go

coderd/notifications/metrics_test.go

coderd/notifications/notifier.go

mtojek

Nothing blocking 👍

coderd/database/dbmem/dbmem.go

coderd/database/migrations/000223_notifications_metrics.up.sql

coderd/notifications/notifier.go

coderd/notifications/metrics_test.go

coderd/notifications/metrics.go

mtojek

Thanks for all replies.

Minor refactoring to make testing easier Signed-off-by: Danny Kopping <danny@coder.com>

Signed-off-by: Danny Kopping <danny@coder.com>

Co-authored-by: Mathias Fredriksson <mafredri@gmail.com>

Signed-off-by: Danny Kopping <danny@coder.com>

spikecurtis · 2024-07-08T15:53:14Z

coderd/notifications/metrics.go

+	subsystem = "notifications"
+
+	LabelMethod     = "method"
+	LabelTemplateID = "template_id"


maybe "notification_template_id" --- there is another, more user-facing concept of "template" which people might assume this refers to.

spikecurtis · 2024-07-08T16:04:11Z

coderd/notifications/metrics.go

+		PermFailureCount: promauto.With(reg).NewCounterVec(prometheus.CounterOpts{
+			Name: "permanent_failures_count", Namespace: ns, Subsystem: subsystem,
+			Help: "The count of notifications which failed and have exceeded their retry attempts.",
+		}, []string{LabelMethod, LabelTemplateID}),


DispatchCount, TempFailureCount, and PermFailureCount might make sense as a single metric with different "disposition" (or "result") labels. The help text refers to them as "notifications", but I think that's misleading. What we are counting is "notification attempts" or "dispatch attempts", which might individually end up in success or some kind of failure. In particular, in the case of failure and retry, the number of delivery attempts can exceed the number of notifications, so it's important to make the distinction.

Prometheus guidelines say that something should be a single metric with labels (vs multiple metrics) when the sum or average of the label is sensible and coherent. Filtering by method and summing across disposition would be useful, for example, to track the number of web requests we are making via Webhook delivery.

Super idea! Simplifies things a lot.

spikecurtis · 2024-07-08T16:07:44Z

coderd/notifications/metrics.go

+		PendingUpdates: promauto.With(reg).NewGauge(prometheus.GaugeOpts{
+			Name: "pending_updates", Namespace: ns, Subsystem: subsystem,
+			Help: "The number of updates waiting to be flushed to the store.",
+		}),


A really useful Gauge is the number of "in progress" delivery attempts. You increment it when you kick off the delivery goroutine and decrement it when you send to success or failure channels. Can help spot hung handlers/leaked goroutines

spikecurtis · 2024-07-08T16:13:25Z

coderd/notifications/metrics.go

+func NewMetrics(reg prometheus.Registerer) *Metrics {
+	return &Metrics{
+		DispatchedCount: promauto.With(reg).NewCounterVec(prometheus.CounterOpts{
+			Name: "dispatched_count", Namespace: ns, Subsystem: subsystem,


accumulating counts should end in the units and total as in coderd_notifications_successful_deliveries_total, where "deliveries" is the unit (could use "dispatch_attempts", or something similar).

https://prometheus.io/docs/practices/naming/

Wow, 4 months out of Grafana Labs and I forget all my Prometheus manners.
Thanks!

spikecurtis · 2024-07-08T16:15:52Z

coderd/notifications/metrics.go

+		// Currently no requirement to discriminate between success and failure updates which are pending.
+		PendingUpdates: promauto.With(reg).NewGauge(prometheus.GaugeOpts{
+			Name: "pending_updates", Namespace: ns, Subsystem: subsystem,
+			Help: "The number of updates waiting to be flushed to the store.",


Sounds too generic to me. "The number of delivery attempt results waiting to be flushed to the store" ?

spikecurtis · 2024-07-08T16:26:14Z

coderd/database/migrations/000223_notifications_metrics.up.sql

@@ -0,0 +1,2 @@
+ALTER TABLE notification_messages
+ADD COLUMN queued_seconds INTEGER NULL;


Do we need to store this on the row? It feels a little strange and ambiguous. If the notification_message is currently queued, then it's a bit meaningless, since we don't know for how much longer it will be queued. I.e. it is at present being returned on Enqueue, but is meaningless in that context.

When we dequeue it, can't the notifier compute how long it was on the queue based on it's created_at or updated_at time? Or, if we really want to ensure all the measurements are based on the central PostgreSQL server clock, then it can part of the result returned to the notifier without having to write it back to the database.

Yeah I agree it's a bit awkward.

The problem is that updated_at is modified when the message is enqueued.

-- name: AcquireNotificationMessages :many WITH acquired AS ( UPDATE notification_messages SET ... updated_at = NOW(),

I tried various approaches to storing the updated_at value prior to its being modified here, but this approach seemed simpler.

"queued" should refer to the time between attempts; I can't use created_at for that reason. Good call on its being returned via EnqueueNotificationMessage; that's definitely not meaningful.

I should probably set it to 0 again if the message is marked as failed; that way the value will always be fairly meaningful:

pending: value should be 0

leased/sent: value should be present

*failure/: value should be 0

spikecurtis · 2024-07-08T16:41:48Z

coderd/notifications/notifier.go

 		}
 	}
+	n.metrics.PendingUpdates.Set(float64(len(success) + len(failure)))


I think it would be cleaner to just compute this metric on-Collect, rather than Seting it when something reads or writes to these channels. As an illustration of how easy it is to miss a spot, note that you have forgotten to set it in process() after a failed call to prepare().

I agree although changing to a Collect-based approach isn't worth the complexity IMHO.
This metric is updated in syncUpdates every 2s (by default), so that's probably good enough™️ for now.

I was initially planning to give syncUpdates the responsibility of updating this exclusively, but since StoreSyncInterval is tuneable, operators could mess with the metric collection in a way that is difficult to reason about.

I'll add the missing update, thanks.

spikecurtis · 2024-07-08T16:49:25Z

coderd/notifications/metrics.go

+		// Aggregating on LabelTemplateID as well would cause a cardinality explosion.
+		QueuedSeconds: promauto.With(reg).NewHistogramVec(prometheus.HistogramOpts{
+			Name: "queued_seconds", Namespace: ns, Subsystem: subsystem,
+			Buckets: []float64{1, 5, 15, 30, 60, 120, 300, 600, 3600, 86400},


We say this should be less than the fetch interval, which is 15s --- so I think there should be more buckets <15s. In fact, it should be more like 15/(n+1) where n is the number of Coderd replicas, assuming that notifications usually get picked up by the next fetch. Seeing the average move up toward 15s is way more important than the distinction between 10minutes, 1 hour and 1 day, all of which indicate that notifications are just blocked entirely.

Fair point.

spikecurtis · 2024-07-08T16:51:17Z

coderd/notifications/metrics.go

+		// Aggregating on LabelTemplateID as well would cause a cardinality explosion.
+		DispatcherSendSeconds: promauto.With(reg).NewHistogramVec(prometheus.HistogramOpts{
+			Name: "dispatcher_send_seconds", Namespace: ns, Subsystem: subsystem,
+			Buckets: []float64{0.001, 0.05, 0.1, 0.5, 1, 2, 5, 10, 15, 30, 60, 120},


should be 0.001, 0.005, 0.01, 0.05 on the bottom end, right?

I don't see much value in distinguishing between 1ms, 5ms, and 10ms; for most systems the latency will be higher than that (I'm assuming). The high end is what we're more concerned about here since that's indicative of a problem. In other words: this metric is more for diagnosing issues than measuring performance of the endpoint/server.

Dispatching, say, a webhook, within the same cluster could be a few milliseconds latency. Same geographic region could be 10-20ms.

Almost nothing is going to be 1ms unless it's extremely simple and extremely close, but in 50ms you could be dispatching across an entire continent. With buckets like this you could have something that normally returns in 15ms take 3x its normal time and be close to falling over and this metric would not clock it. That seems much more important than 30s, 60s, 120s.

I hear you and I can see where you're coming from.

Let's wait for users to ask for this. I think there's sufficient coverage for now, and the additional datapoints are likely not going to be very useful for the majority of operators IMHO.

spikecurtis · 2024-07-08T16:52:59Z

coderd/notifications/metrics.go

+		PendingUpdates: promauto.With(reg).NewGauge(prometheus.GaugeOpts{
+			Name: "pending_updates", Namespace: ns, Subsystem: subsystem,
+			Help: "The number of updates waiting to be flushed to the store.",
+		}),


Also a count of success/failure updates flushed to the database.

…ations-o11y

Changes made to dbgen are to avoid this panic: $ ./coderd/database/generate.sh generate panic: unknown return type: *dst.SelectorExpr goroutine 1 [running]: main.orderAndStubDatabaseFunctions({0x1400001ea80, 0x3c}, {0x102d019e0, 0x1}, {0x102d0bc75, 0xb}, 0x102df7b20) /Users/danny/Code/coder/coder/scripts/dbgen/main.go:367 +0x2588 main.run() /Users/danny/Code/coder/coder/scripts/dbgen/main.go:56 +0x118 main.main() /Users/danny/Code/coder/coder/scripts/dbgen/main.go:42 +0x1c exit status 2 Signed-off-by: Danny Kopping <danny@coder.com>

Signed-off-by: Danny Kopping <danny@coder.com>

dannykopping · 2024-07-09T15:05:41Z

coderd/notifications/notifier.go

@@ -215,24 +234,55 @@ func (n *notifier) deliver(ctx context.Context, msg database.AcquireNotification
 		case <-ctx.Done():
 			logger.Warn(context.Background(), "cannot record dispatch failure result", slog.Error(ctx.Err()))
 			return ctx.Err()
-		default:
+		case failure <- n.newFailedDispatch(msg, err, retryable):


Addresses #13537 (comment)

Signed-off-by: Danny Kopping <danny@coder.com>

mtojek

👍

coderd/database/queries/notifications.sql

coderd/notifications/metrics_test.go

Signed-off-by: Danny Kopping <danny@coder.com>

…notifications-o11y

Signed-off-by: Danny Kopping <danny@coder.com>

…notifications-o11y

Signed-off-by: Danny Kopping <danny@coder.com>

…notifications-o11y

Signed-off-by: Danny Kopping <danny@coder.com>

mtojek

Stamp 👍

github-actions bot assigned dannykopping Jul 8, 2024

dannykopping force-pushed the dk/system-notifications-o11y branch 3 times, most recently from ad7f470 to 18dbabe Compare July 8, 2024 08:47

dannykopping requested review from mtojek, matifali, mafredri and spikecurtis and removed request for matifali July 8, 2024 08:55

dannykopping marked this pull request as ready for review July 8, 2024 08:55

dannykopping commented Jul 8, 2024

View reviewed changes

coderd/notifications/notifier.go Show resolved Hide resolved

dannykopping marked this pull request as draft July 8, 2024 09:03

mafredri approved these changes Jul 8, 2024

View reviewed changes

mtojek approved these changes Jul 8, 2024

View reviewed changes

Base automatically changed from dk/system-notifications-lib to main July 8, 2024 13:38

dannykopping and others added 8 commits July 8, 2024 15:54

Implement observability of notification subsystem

7f60c0f

Minor refactoring to make testing easier Signed-off-by: Danny Kopping <danny@coder.com>

make lint

d62d704

Signed-off-by: Danny Kopping <danny@coder.com>

make gen

96dac65

Signed-off-by: Danny Kopping <danny@coder.com>

make fmt

130de49

Signed-off-by: Danny Kopping <danny@coder.com>

Small fixes

e868752

Signed-off-by: Danny Kopping <danny@coder.com>

Review comments

cee93cb

Signed-off-by: Danny Kopping <danny@coder.com>

Apply suggestions from code review

387b557

Co-authored-by: Mathias Fredriksson <mafredri@gmail.com>

Correcting query

114797d

Signed-off-by: Danny Kopping <danny@coder.com>

dannykopping force-pushed the dk/system-notifications-o11y branch from 0e6fdae to 114797d Compare July 8, 2024 13:55

dannykopping marked this pull request as ready for review July 8, 2024 13:55

spikecurtis reviewed Jul 8, 2024

View reviewed changes

dannykopping added 3 commits July 9, 2024 09:12

Merge branch 'main' of github.com:/coder/coder into dk/system-notific…

5ff29c0

…ations-o11y

Review feedback

09f7305

Signed-off-by: Danny Kopping <danny@coder.com>

dannykopping commented Jul 9, 2024

View reviewed changes

Minor fixups

91e2a23

Signed-off-by: Danny Kopping <danny@coder.com>

mtojek approved these changes Jul 9, 2024

View reviewed changes

mafredri reviewed Jul 9, 2024

View reviewed changes

coderd/database/queries/notifications.sql Outdated Show resolved Hide resolved

coderd/notifications/metrics_test.go Outdated Show resolved Hide resolved

mafredri approved these changes Jul 9, 2024

View reviewed changes

dannykopping added 2 commits July 10, 2024 10:37

Revert hack, no output param needed

9f1d6b3

Signed-off-by: Danny Kopping <danny@coder.com>

Small touch-ups

15c4537

Signed-off-by: Danny Kopping <danny@coder.com>

dannykopping force-pushed the dk/system-notifications-o11y branch from 44a0e9f to 15c4537 Compare July 10, 2024 08:49

dannykopping mentioned this pull request Jul 10, 2024

chore: improve notifications tests #13863

Merged

dannykopping added 13 commits July 10, 2024 14:16

Merge branch 'main' of https://github.com/coder/coder into dk/system-…

53ecad4

…notifications-o11y

Merge branch 'main' of https://github.com/coder/coder into dk/system-…

bc2a4cb

…notifications-o11y

Harden tests, fail early

716e591

Signed-off-by: Danny Kopping <danny@coder.com>

make fmt

d408ed2

Signed-off-by: Danny Kopping <danny@coder.com>

Restoring deleted line

2b9eec3

Signed-off-by: Danny Kopping <danny@coder.com>

Comments

4211c84

Signed-off-by: Danny Kopping <danny@coder.com>

Lock before modification

24417c5

Signed-off-by: Danny Kopping <danny@coder.com>

Remove TestNotifierPaused's unnecessarily fast fetch interval

72bb1be

Signed-off-by: Danny Kopping <danny@coder.com>

Merge branch 'main' of https://github.com/coder/coder into dk/system-…

bfca2c1

…notifications-o11y

Rename migration after numbering conflict

6602682

Signed-off-by: Danny Kopping <danny@coder.com>

Small fixes

00633a1

Signed-off-by: Danny Kopping <danny@coder.com>

Merge branch 'main' of https://github.com/coder/coder into dk/system-…

f454184

…notifications-o11y

Logging improvement

84d07d4

Signed-off-by: Danny Kopping <danny@coder.com>

mtojek approved these changes Jul 11, 2024

View reviewed changes

dannykopping merged commit b2dab33 into main Jul 11, 2024
28 checks passed

dannykopping deleted the dk/system-notifications-o11y branch July 11, 2024 08:57

github-actions bot locked and limited conversation to collaborators Jul 11, 2024

		@@ -0,0 +1,2 @@
		ALTER TABLE notification_messages
		ADD COLUMN queued_seconds FLOAT NULL;

		@@ -0,0 +1,2 @@
		ALTER TABLE notification_messages
		ADD COLUMN queued_seconds INTEGER NULL;

feat: implement observability of notifications subsystem #13799

feat: implement observability of notifications subsystem #13799

Uh oh!

Conversation

dannykopping commented Jul 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

mafredri left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mafredri Jul 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mtojek left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mtojek left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

spikecurtis Jul 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dannykopping commented Jul 8, 2024 •

edited

Loading

mafredri Jul 8, 2024 •

edited

Loading

spikecurtis Jul 8, 2024 •

edited

Loading