-
Notifications
You must be signed in to change notification settings - Fork 887
feat: Collect agent SSH metrics #7584
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wary of using global state for this.
We originally did it this way with promauto
in V1 and ripped it out later.
https://github.com/coder/v1/issues/13357
Global state is basically impossible to test and can create unintended side effects just by importing packages.
I realize we're already a bit wed to tailscale.com/util/clientmetric
in order to get metrics out of tailscale, but I don't think we should add to the problem.
As an alternative, we could attach a prometheus registry to the agent and register the metrics there, and then collect them to ship to coderd.
I admit that it was the reason why I decided to expose agent metrics this way. There is a function called I don't mind enabling a custom Prometheus registry in the agent to collect these metrics. It requires more implementation on our side, so it convinced me to stick with I will try to refactor this to use the registry 👍 . |
I refactored the source to use PrometheusRegistry. It resulted in implementing a custom handler to merge Tailscale and agent metrics. Let me know your thoughts, and in the meantime, I will proceed with testing. Unfortunately, the size of PR jumped from ~160 to ~450 LOC. I guess that I can limit it but providing some wrapper for |
agent/agentssh/agentssh.go
Outdated
var exitError *exec.ExitError | ||
if xerrors.As(err, &exitError) { | ||
s.logger.Debug(ctx, "ssh session returned", slog.Error(exitError)) | ||
s.logger.Warn(ctx, "ssh session returned", slog.Error(exitError)) | ||
m.sessionError.Add(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This construction where we increment error metrics both here and down the stack where they occur means that we count the same error in more than one metric. This makes sums over the error metrics very misleading.
A better design is to only increment the metrics here in this function by selecting on the type of error
returned by s.sessionStart
.
You could create an error wrapper e.g.
type metricError struct {
err error
labels prometheus.Labels
}
func (m metricError) Error() string {
return m.Unwrap().Error()
}
func (m metricError) Unwrap() error {
return m.err
}
Then use xerrrors.As
to check for this type and increment accordingly. If we forget to wrap some errors, we can have an "unknown" error type that we increment in this case, which could signal us to go back and wrap the missing error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This construction where we increment error metrics both here and down the stack where they occur means that we count the same error in more than one metric.
I shall disagree, as having a generic metric to indicate that a session error happened is much easier to set up alarming. You don't need to watch every single metric and eventually miss newly added ones, but just have one to depend on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In prometheus it's easy to query for the sum across the label of error metrics. No chance of missing one if more label values are added later.
agent/agentssh/agentssh.go
Outdated
_, _ = io.Copy(ptty.InputWriter(), session) | ||
_, err := io.Copy(ptty.InputWriter(), session) | ||
if err != nil { | ||
m.ptyInputIoCopyError.Add(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note that my earlier comment about wrapping errors and incrementing the metrics higher in the stack do not apply to these errors that are in separate goroutines---there is nothing higher in the stack to handle incrementing metrics, and in a real sense it is a distinct error from any errors hit on the main goroutine, so we're not double counting by incrementing here.
@spikecurtis I refactored the source code to use metric labels. Looking at the code now (although it is still a draft), I'm not convinced about the error-wrapping technique. Let me challenge it. There are cases where we'd like to collect more than one error, and trying to aggregate a few of them requires extra logic (at least Consider the following code: if !isQuietLogin(session.RawCommand()) {
manifest := s.Manifest.Load()
if manifest != nil {
err := showMOTD(session, manifest.MOTDFile)
if err != nil {
s.logger.Error(ctx, "show MOTD", slog.Error(err))
s.metrics.sessionErrors.WithLabelValues(magicTypeLabel, "yes", "motd").Add(1)
}
} else {
s.logger.Warn(ctx, "metadata lookup failed, unable to show MOTD")
}
} We'd like to know that something went wrong with MOTD, but it doesn't mean that the session gets terminated instantly. If something bad happens to the session later, the code will have to bubble up a sum of errors. Personally, I don't see a big gain here compared to a simple intercepting one-liner: s.metrics.sessionErrors.WithLabelValues(magicTypeLabel, "yes", "motd").Add(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only have nits, but will defer to others for approval.
@@ -1724,7 +1726,7 @@ func (c closeFunc) Close() error { | |||
return c() | |||
} | |||
|
|||
func setupAgent(t *testing.T, metadata agentsdk.Manifest, ptyTimeout time.Duration) ( | |||
func setupAgent(t *testing.T, metadata agentsdk.Manifest, ptyTimeout time.Duration, opts ...func(agent.Options) agent.Options) ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: remove ptyTimeout
and set a default value in options
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I will push this in a separate PR. That function is used in ~30 places.
That's fair---the high level goal is that each legitimately distinct error is counted exactly once. For errors that don't terminate the session, I think it's fine to just directly increment the error counter and move on. What I want to avoid though, is double-counting, where an error type is incremented, and then a "generic error" count is also incremented after the function returns. We could limit the error wrapping to just those errors that trigger an immediate return in the session handler. The feature I really like about error wrapping is that it affords us the ability to ensure that we really are counting all the session-ending errors and not just forgetting in some |
I see your point, but considering the fact that we have to increase the error counter in two different ways, the argumentation does not convince me.
It is a theoretical use case, as most crucial code paths fail fast and return (connection is terminated). There is a low chance that the same counter is incremented twice.
This is a valid point, but in my opinion, it complicates the code logic too much - I suppose that this is a matter of personal preference. I'm wondering if we can push this change in the follow-up iteration to deliver the feature first. What you're suggesting here is pure refactoring. Nevertheless, I will start doing it now 👍 . |
I don't mean as some race condition --- I just mean counting an error at two different places in the stack. Your earlier PR incremented metrics both in |
Name string `json:"name" validate:"required"` | ||
Type AgentMetricType `json:"type" validate:"required" enums:"counter,gauge"` | ||
Value float64 `json:"value" validate:"required"` | ||
Labels []AgentMetricLabel `json:"labels,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
labels are key-value pairs --- why not just use a map[string]string
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer using slices over maps to keep the order of elements in transit. I didn't want to sort them specifically on the receiver side.
LGTM! |
Related: #6724
This PR adds more metrics to the SSH agent. These should help with identifying connectivity issues.
Actually, it is similar to metrics support in the tailssh.challenged by Spike to use metric labels.Sample metrics: