-
Notifications
You must be signed in to change notification settings - Fork 905
feat: implement observability of notifications subsystem #13799
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
ad7f470
to
18dbabe
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW the changes LGTM in isolation (I haven't reviewed the original code).
@@ -0,0 +1,2 @@ | |||
ALTER TABLE notification_messages | |||
ADD COLUMN queued_seconds FLOAT NULL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: The name queued_seconds
implies an integer, consider changing the type or the name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it necessarily implies integer; time.Duration
's Seconds()
returns a float64.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough 😄.
In the DB, usually it means seconds:
lifetime_seconds bigint DEFAULT 86400 NOT NULL,
timeout_seconds integer NOT NULL
connection_timeout_seconds integer DEFAULT 0 NOT NULL,
But fine to keep as float if we care about the precision (although I'd prefer to multiply the number to required precision in int).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose I don't strictly need to keep sub-second precision on the length of time messages were queued for.
I'll switch to integer
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is backing a Prometheus metric, which are always floats, and by convention end in the SI unit, which for time is seconds. It implies no precision.
This should be a float: there just isn't any compelling reason to pre-judge the required precision. This is not a banking application, nor do we need to worry about the relative speed of integer vs floating point arithmetic.
The name queued_seconds
is appropriate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing blocking 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all replies.
Minor refactoring to make testing easier Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
Co-authored-by: Mathias Fredriksson <mafredri@gmail.com>
Signed-off-by: Danny Kopping <danny@coder.com>
0e6fdae
to
114797d
Compare
coderd/notifications/metrics.go
Outdated
subsystem = "notifications" | ||
|
||
LabelMethod = "method" | ||
LabelTemplateID = "template_id" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe "notification_template_id" --- there is another, more user-facing concept of "template" which people might assume this refers to.
coderd/notifications/metrics.go
Outdated
PermFailureCount: promauto.With(reg).NewCounterVec(prometheus.CounterOpts{ | ||
Name: "permanent_failures_count", Namespace: ns, Subsystem: subsystem, | ||
Help: "The count of notifications which failed and have exceeded their retry attempts.", | ||
}, []string{LabelMethod, LabelTemplateID}), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DispatchCount, TempFailureCount, and PermFailureCount might make sense as a single metric with different "disposition" (or "result") labels. The help text refers to them as "notifications", but I think that's misleading. What we are counting is "notification attempts" or "dispatch attempts", which might individually end up in success or some kind of failure. In particular, in the case of failure and retry, the number of delivery attempts can exceed the number of notifications, so it's important to make the distinction.
Prometheus guidelines say that something should be a single metric with labels (vs multiple metrics) when the sum or average of the label is sensible and coherent. Filtering by method and summing across disposition would be useful, for example, to track the number of web requests we are making via Webhook delivery.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super idea! Simplifies things a lot.
PendingUpdates: promauto.With(reg).NewGauge(prometheus.GaugeOpts{ | ||
Name: "pending_updates", Namespace: ns, Subsystem: subsystem, | ||
Help: "The number of updates waiting to be flushed to the store.", | ||
}), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A really useful Gauge is the number of "in progress" delivery attempts. You increment it when you kick off the delivery goroutine and decrement it when you send to success
or failure
channels. Can help spot hung handlers/leaked goroutines
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea!
coderd/notifications/metrics.go
Outdated
func NewMetrics(reg prometheus.Registerer) *Metrics { | ||
return &Metrics{ | ||
DispatchedCount: promauto.With(reg).NewCounterVec(prometheus.CounterOpts{ | ||
Name: "dispatched_count", Namespace: ns, Subsystem: subsystem, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
accumulating counts should end in the units and total as in coderd_notifications_successful_deliveries_total
, where "deliveries" is the unit (could use "dispatch_attempts", or something similar).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, 4 months out of Grafana Labs and I forget all my Prometheus manners.
Thanks!
coderd/notifications/metrics.go
Outdated
// Currently no requirement to discriminate between success and failure updates which are pending. | ||
PendingUpdates: promauto.With(reg).NewGauge(prometheus.GaugeOpts{ | ||
Name: "pending_updates", Namespace: ns, Subsystem: subsystem, | ||
Help: "The number of updates waiting to be flushed to the store.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds too generic to me. "The number of delivery attempt results waiting to be flushed to the store" ?
@@ -0,0 +1,2 @@ | |||
ALTER TABLE notification_messages | |||
ADD COLUMN queued_seconds INTEGER NULL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to store this on the row? It feels a little strange and ambiguous. If the notification_message is currently queued, then it's a bit meaningless, since we don't know for how much longer it will be queued. I.e. it is at present being returned on Enqueue, but is meaningless in that context.
When we dequeue it, can't the notifier compute how long it was on the queue based on it's created_at
or updated_at
time? Or, if we really want to ensure all the measurements are based on the central PostgreSQL server clock, then it can part of the result returned to the notifier without having to write it back to the database.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I agree it's a bit awkward.
The problem is that updated_at
is modified when the message is enqueued.
-- name: AcquireNotificationMessages :many
WITH acquired AS (
UPDATE
notification_messages
SET ...
updated_at = NOW(),
I tried various approaches to storing the updated_at
value prior to its being modified here, but this approach seemed simpler.
"queued" should refer to the time between attempts; I can't use created_at
for that reason. Good call on its being returned via EnqueueNotificationMessage
; that's definitely not meaningful.
I should probably set it to 0
again if the message is marked as failed; that way the value will always be fairly meaningful:
pending
: value should be 0leased
/sent
: value should be present*failure
/: value should be 0
} | ||
} | ||
n.metrics.PendingUpdates.Set(float64(len(success) + len(failure))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be cleaner to just compute this metric on-Collect
, rather than Set
ing it when something reads or writes to these channels. As an illustration of how easy it is to miss a spot, note that you have forgotten to set it in process()
after a failed call to prepare()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree although changing to a Collect
-based approach isn't worth the complexity IMHO.
This metric is updated in syncUpdates
every 2s (by default), so that's probably good enough™️ for now.
I was initially planning to give syncUpdates
the responsibility of updating this exclusively, but since StoreSyncInterval
is tuneable, operators could mess with the metric collection in a way that is difficult to reason about.
I'll add the missing update, thanks.
coderd/notifications/metrics.go
Outdated
// Aggregating on LabelTemplateID as well would cause a cardinality explosion. | ||
QueuedSeconds: promauto.With(reg).NewHistogramVec(prometheus.HistogramOpts{ | ||
Name: "queued_seconds", Namespace: ns, Subsystem: subsystem, | ||
Buckets: []float64{1, 5, 15, 30, 60, 120, 300, 600, 3600, 86400}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We say this should be less than the fetch interval, which is 15s --- so I think there should be more buckets <15s. In fact, it should be more like 15/(n+1) where n is the number of Coderd replicas, assuming that notifications usually get picked up by the next fetch. Seeing the average move up toward 15s is way more important than the distinction between 10minutes, 1 hour and 1 day, all of which indicate that notifications are just blocked entirely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point.
// Aggregating on LabelTemplateID as well would cause a cardinality explosion. | ||
DispatcherSendSeconds: promauto.With(reg).NewHistogramVec(prometheus.HistogramOpts{ | ||
Name: "dispatcher_send_seconds", Namespace: ns, Subsystem: subsystem, | ||
Buckets: []float64{0.001, 0.05, 0.1, 0.5, 1, 2, 5, 10, 15, 30, 60, 120}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be 0.001, 0.005, 0.01, 0.05
on the bottom end, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see much value in distinguishing between 1ms, 5ms, and 10ms; for most systems the latency will be higher than that (I'm assuming). The high end is what we're more concerned about here since that's indicative of a problem. In other words: this metric is more for diagnosing issues than measuring performance of the endpoint/server.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dispatching, say, a webhook, within the same cluster could be a few milliseconds latency. Same geographic region could be 10-20ms.
Almost nothing is going to be 1ms unless it's extremely simple and extremely close, but in 50ms you could be dispatching across an entire continent. With buckets like this you could have something that normally returns in 15ms take 3x its normal time and be close to falling over and this metric would not clock it. That seems much more important than 30s, 60s, 120s.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hear you and I can see where you're coming from.
Let's wait for users to ask for this. I think there's sufficient coverage for now, and the additional datapoints are likely not going to be very useful for the majority of operators IMHO.
PendingUpdates: promauto.With(reg).NewGauge(prometheus.GaugeOpts{ | ||
Name: "pending_updates", Namespace: ns, Subsystem: subsystem, | ||
Help: "The number of updates waiting to be flushed to the store.", | ||
}), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also a count of success/failure updates flushed to the database.
Changes made to dbgen are to avoid this panic: $ ./coderd/database/generate.sh generate panic: unknown return type: *dst.SelectorExpr goroutine 1 [running]: main.orderAndStubDatabaseFunctions({0x1400001ea80, 0x3c}, {0x102d019e0, 0x1}, {0x102d0bc75, 0xb}, 0x102df7b20) /Users/danny/Code/coder/coder/scripts/dbgen/main.go:367 +0x2588 main.run() /Users/danny/Code/coder/coder/scripts/dbgen/main.go:56 +0x118 main.main() /Users/danny/Code/coder/coder/scripts/dbgen/main.go:42 +0x1c exit status 2 Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
@@ -215,24 +234,55 @@ func (n *notifier) deliver(ctx context.Context, msg database.AcquireNotification | |||
case <-ctx.Done(): | |||
logger.Warn(context.Background(), "cannot record dispatch failure result", slog.Error(ctx.Err())) | |||
return ctx.Err() | |||
default: | |||
case failure <- n.newFailedDispatch(msg, err, retryable): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addresses #13537 (comment)
Signed-off-by: Danny Kopping <danny@coder.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
44a0e9f
to
15c4537
Compare
…notifications-o11y
…notifications-o11y
Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
…notifications-o11y
Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
…notifications-o11y
Signed-off-by: Danny Kopping <danny@coder.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stamp 👍
Closes coder/internal#3
Also performed some drive-by refactorings.
NOTE: this is stacked on `dk/system-notifications-lib, but I didn't want to use Graphite for fear it screws up that branch which is close to being merged; I'm still pretty new to that tool.