Skip to content

feat: Collect agent SSH metrics #7584

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 42 commits into from
May 25, 2023
Merged

feat: Collect agent SSH metrics #7584

merged 42 commits into from
May 25, 2023

Conversation

mtojek
Copy link
Member

@mtojek mtojek commented May 17, 2023

Related: #6724

This PR adds more metrics to the SSH agent. These should help with identifying connectivity issues.

Actually, it is similar to metrics support in the tailssh. challenged by Spike to use metric labels.

Sample metrics:

agent_sessions_total{agent_name="main",magic_type="ssh",pty="yes",username="admin",workspace_name="docker-1"} 5
agent_sessions_total{agent_name="main",magic_type="vscode",pty="no",username="admin",workspace_name="docker-1"} 1

@mtojek mtojek marked this pull request as ready for review May 18, 2023 11:16
@mtojek mtojek requested a review from johnstcn May 18, 2023 11:17
@mtojek mtojek requested a review from spikecurtis May 18, 2023 11:20
Copy link
Contributor

@spikecurtis spikecurtis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wary of using global state for this.

We originally did it this way with promauto in V1 and ripped it out later.

https://github.com/coder/v1/issues/13357

Global state is basically impossible to test and can create unintended side effects just by importing packages.

I realize we're already a bit wed to tailscale.com/util/clientmetric in order to get metrics out of tailscale, but I don't think we should add to the problem.

As an alternative, we could attach a prometheus registry to the agent and register the metrics there, and then collect them to ship to coderd.

@mtojek
Copy link
Member Author

mtojek commented May 18, 2023

I realize we're already a bit wed to tailscale.com/util/clientmetric in order to get metrics out of tailscale, but I don't think we should add to the problem.

I admit that it was the reason why I decided to expose agent metrics this way. There is a function called collectMetrics(), defined in metrics.go, that reviews tailscale clientmetrics and attaches them to the Stats object.

I don't mind enabling a custom Prometheus registry in the agent to collect these metrics. It requires more implementation on our side, so it convinced me to stick with tailscale.com/util/clientmetric.

I will try to refactor this to use the registry 👍 .

@mtojek
Copy link
Member Author

mtojek commented May 18, 2023

I refactored the source to use PrometheusRegistry. It resulted in implementing a custom handler to merge Tailscale and agent metrics. Let me know your thoughts, and in the meantime, I will proceed with testing.

Unfortunately, the size of PR jumped from ~160 to ~450 LOC. I guess that I can limit it but providing some wrapper for prometheus.NewCounter + registerer.MustRegister.

var exitError *exec.ExitError
if xerrors.As(err, &exitError) {
s.logger.Debug(ctx, "ssh session returned", slog.Error(exitError))
s.logger.Warn(ctx, "ssh session returned", slog.Error(exitError))
m.sessionError.Add(1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This construction where we increment error metrics both here and down the stack where they occur means that we count the same error in more than one metric. This makes sums over the error metrics very misleading.

A better design is to only increment the metrics here in this function by selecting on the type of error returned by s.sessionStart.

You could create an error wrapper e.g.

type metricError struct {
    err error
    labels prometheus.Labels
} 

func (m metricError) Error() string {
    return m.Unwrap().Error()
}

func (m metricError) Unwrap() error {
    return m.err
}

Then use xerrrors.As to check for this type and increment accordingly. If we forget to wrap some errors, we can have an "unknown" error type that we increment in this case, which could signal us to go back and wrap the missing error.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This construction where we increment error metrics both here and down the stack where they occur means that we count the same error in more than one metric.

I shall disagree, as having a generic metric to indicate that a session error happened is much easier to set up alarming. You don't need to watch every single metric and eventually miss newly added ones, but just have one to depend on.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In prometheus it's easy to query for the sum across the label of error metrics. No chance of missing one if more label values are added later.

_, _ = io.Copy(ptty.InputWriter(), session)
_, err := io.Copy(ptty.InputWriter(), session)
if err != nil {
m.ptyInputIoCopyError.Add(1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note that my earlier comment about wrapping errors and incrementing the metrics higher in the stack do not apply to these errors that are in separate goroutines---there is nothing higher in the stack to handle incrementing metrics, and in a real sense it is a distinct error from any errors hit on the main goroutine, so we're not double counting by incrementing here.

@mtojek
Copy link
Member Author

mtojek commented May 23, 2023

@spikecurtis I refactored the source code to use metric labels. Looking at the code now (although it is still a draft), I'm not convinced about the error-wrapping technique.

Let me challenge it. There are cases where we'd like to collect more than one error, and trying to aggregate a few of them requires extra logic (at least append operation).

Consider the following code:

	if !isQuietLogin(session.RawCommand()) {
		manifest := s.Manifest.Load()
		if manifest != nil {
			err := showMOTD(session, manifest.MOTDFile)
			if err != nil {
				s.logger.Error(ctx, "show MOTD", slog.Error(err))
				s.metrics.sessionErrors.WithLabelValues(magicTypeLabel, "yes", "motd").Add(1)
			}
		} else {
			s.logger.Warn(ctx, "metadata lookup failed, unable to show MOTD")
		}
	}

We'd like to know that something went wrong with MOTD, but it doesn't mean that the session gets terminated instantly. If something bad happens to the session later, the code will have to bubble up a sum of errors. Personally, I don't see a big gain here compared to a simple intercepting one-liner:

s.metrics.sessionErrors.WithLabelValues(magicTypeLabel, "yes", "motd").Add(1)

@mtojek mtojek requested review from spikecurtis and johnstcn May 23, 2023 14:49
@mtojek mtojek marked this pull request as ready for review May 23, 2023 14:49
Copy link
Member

@johnstcn johnstcn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only have nits, but will defer to others for approval.

@@ -1724,7 +1726,7 @@ func (c closeFunc) Close() error {
return c()
}

func setupAgent(t *testing.T, metadata agentsdk.Manifest, ptyTimeout time.Duration) (
func setupAgent(t *testing.T, metadata agentsdk.Manifest, ptyTimeout time.Duration, opts ...func(agent.Options) agent.Options) (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: remove ptyTimeout and set a default value in options

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I will push this in a separate PR. That function is used in ~30 places.

@spikecurtis
Copy link
Contributor

I'm not convinced about the error-wrapping technique.

Let me challenge it. There are cases where we'd like to collect more than one error, and trying to aggregate a few of them requires extra logic (at least append operation).

That's fair---the high level goal is that each legitimately distinct error is counted exactly once. For errors that don't terminate the session, I think it's fine to just directly increment the error counter and move on.

What I want to avoid though, is double-counting, where an error type is incremented, and then a "generic error" count is also incremented after the function returns. We could limit the error wrapping to just those errors that trigger an immediate return in the session handler. The feature I really like about error wrapping is that it affords us the ability to ensure that we really are counting all the session-ending errors and not just forgetting in some return paths.

@mtojek mtojek marked this pull request as draft May 25, 2023 06:58
@mtojek
Copy link
Member Author

mtojek commented May 25, 2023

I see your point, but considering the fact that we have to increase the error counter in two different ways, the argumentation does not convince me.

What I want to avoid though, is double-counting, where an error type is incremented, and then a "generic error" count is also incremented after the function returns.

It is a theoretical use case, as most crucial code paths fail fast and return (connection is terminated). There is a low chance that the same counter is incremented twice.

The feature I really like about error wrapping is that it affords us the ability to ensure that we really are counting all the session-ending errors and not just forgetting in some return paths.

This is a valid point, but in my opinion, it complicates the code logic too much - I suppose that this is a matter of personal preference.

I'm wondering if we can push this change in the follow-up iteration to deliver the feature first. What you're suggesting here is pure refactoring. Nevertheless, I will start doing it now 👍 .

@spikecurtis
Copy link
Contributor

It is a theoretical use case, as most crucial code paths fail fast and return (connection is terminated). There is a low chance that the same counter is incremented twice.

I don't mean as some race condition --- I just mean counting an error at two different places in the stack. Your earlier PR incremented metrics both in sessionStart and then also in sessionHandler if it returned an error. Doesn't look like this PR does this any more, so it's fine not to implement error wrapping if you feel strongly about it.

Name string `json:"name" validate:"required"`
Type AgentMetricType `json:"type" validate:"required" enums:"counter,gauge"`
Value float64 `json:"value" validate:"required"`
Labels []AgentMetricLabel `json:"labels,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

labels are key-value pairs --- why not just use a map[string]string?

Copy link
Member Author

@mtojek mtojek May 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer using slices over maps to keep the order of elements in transit. I didn't want to sort them specifically on the receiver side.

@mtojek mtojek marked this pull request as ready for review May 25, 2023 10:08
@mtojek mtojek requested a review from spikecurtis May 25, 2023 10:09
@spikecurtis
Copy link
Contributor

LGTM!

@mtojek mtojek merged commit 14efdad into coder:main May 25, 2023
@github-actions github-actions bot locked and limited conversation to collaborators May 25, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants