feat: Collect agent SSH metrics #7584

mtojek · 2023-05-17T13:09:36Z

Related: #6724

This PR adds more metrics to the SSH agent. These should help with identifying connectivity issues.

~~Actually, it is similar to metrics support in the tailssh.~~ challenged by Spike to use metric labels.

Sample metrics:

agent_sessions_total{agent_name="main",magic_type="ssh",pty="yes",username="admin",workspace_name="docker-1"} 5
agent_sessions_total{agent_name="main",magic_type="vscode",pty="no",username="admin",workspace_name="docker-1"} 1

spikecurtis

I'm wary of using global state for this.

We originally did it this way with promauto in V1 and ripped it out later.

https://github.com/coder/v1/issues/13357

Global state is basically impossible to test and can create unintended side effects just by importing packages.

I realize we're already a bit wed to tailscale.com/util/clientmetric in order to get metrics out of tailscale, but I don't think we should add to the problem.

As an alternative, we could attach a prometheus registry to the agent and register the metrics there, and then collect them to ship to coderd.

mtojek · 2023-05-18T12:56:18Z

I realize we're already a bit wed to tailscale.com/util/clientmetric in order to get metrics out of tailscale, but I don't think we should add to the problem.

I admit that it was the reason why I decided to expose agent metrics this way. There is a function called collectMetrics(), defined in metrics.go, that reviews tailscale clientmetrics and attaches them to the Stats object.

I don't mind enabling a custom Prometheus registry in the agent to collect these metrics. It requires more implementation on our side, so it convinced me to stick with tailscale.com/util/clientmetric.

I will try to refactor this to use the registry 👍 .

mtojek · 2023-05-18T15:52:03Z

I refactored the source to use PrometheusRegistry. It resulted in implementing a custom handler to merge Tailscale and agent metrics. Let me know your thoughts, and in the meantime, I will proceed with testing.

Unfortunately, the size of PR jumped from ~160 to ~450 LOC. I guess that I can limit it but providing some wrapper for prometheus.NewCounter + registerer.MustRegister.

agent/metrics.go

cli/agent.go

agent/agentssh/metrics.go

spikecurtis · 2023-05-19T05:52:58Z

agent/agentssh/agentssh.go

 	var exitError *exec.ExitError
 	if xerrors.As(err, &exitError) {
-		s.logger.Debug(ctx, "ssh session returned", slog.Error(exitError))
+		s.logger.Warn(ctx, "ssh session returned", slog.Error(exitError))
+		m.sessionError.Add(1)


This construction where we increment error metrics both here and down the stack where they occur means that we count the same error in more than one metric. This makes sums over the error metrics very misleading.

A better design is to only increment the metrics here in this function by selecting on the type of error returned by s.sessionStart.

You could create an error wrapper e.g.

type metricError struct { err error labels prometheus.Labels } func (m metricError) Error() string { return m.Unwrap().Error() } func (m metricError) Unwrap() error { return m.err }

Then use xerrrors.As to check for this type and increment accordingly. If we forget to wrap some errors, we can have an "unknown" error type that we increment in this case, which could signal us to go back and wrap the missing error.

This construction where we increment error metrics both here and down the stack where they occur means that we count the same error in more than one metric.

I shall disagree, as having a generic metric to indicate that a session error happened is much easier to set up alarming. You don't need to watch every single metric and eventually miss newly added ones, but just have one to depend on.

In prometheus it's easy to query for the sum across the label of error metrics. No chance of missing one if more label values are added later.

agent/agentssh/x11.go

agent/agentssh/agentssh_internal_test.go

spikecurtis · 2023-05-19T06:03:59Z

agent/agentssh/agentssh.go

-		_, _ = io.Copy(ptty.InputWriter(), session)
+		_, err := io.Copy(ptty.InputWriter(), session)
+		if err != nil {
+			m.ptyInputIoCopyError.Add(1)


note that my earlier comment about wrapping errors and incrementing the metrics higher in the stack do not apply to these errors that are in separate goroutines---there is nothing higher in the stack to handle incrementing metrics, and in a real sense it is a distinct error from any errors hit on the main goroutine, so we're not double counting by incrementing here.

cli/agent.go

agent/agentssh/agentssh.go

mtojek · 2023-05-23T11:41:54Z

@spikecurtis I refactored the source code to use metric labels. Looking at the code now (although it is still a draft), I'm not convinced about the error-wrapping technique.

Let me challenge it. There are cases where we'd like to collect more than one error, and trying to aggregate a few of them requires extra logic (at least append operation).

Consider the following code:

	if !isQuietLogin(session.RawCommand()) {
		manifest := s.Manifest.Load()
		if manifest != nil {
			err := showMOTD(session, manifest.MOTDFile)
			if err != nil {
				s.logger.Error(ctx, "show MOTD", slog.Error(err))
				s.metrics.sessionErrors.WithLabelValues(magicTypeLabel, "yes", "motd").Add(1)
			}
		} else {
			s.logger.Warn(ctx, "metadata lookup failed, unable to show MOTD")
		}
	}

We'd like to know that something went wrong with MOTD, but it doesn't mean that the session gets terminated instantly. If something bad happens to the session later, the code will have to bubble up a sum of errors. Personally, I don't see a big gain here compared to a simple intercepting one-liner:

s.metrics.sessionErrors.WithLabelValues(magicTypeLabel, "yes", "motd").Add(1)

johnstcn

I only have nits, but will defer to others for approval.

johnstcn · 2023-05-23T16:37:14Z

agent/agent_test.go

@@ -1724,7 +1726,7 @@ func (c closeFunc) Close() error {
 	return c()
 }

-func setupAgent(t *testing.T, metadata agentsdk.Manifest, ptyTimeout time.Duration) (
+func setupAgent(t *testing.T, metadata agentsdk.Manifest, ptyTimeout time.Duration, opts ...func(agent.Options) agent.Options) (


suggestion: remove ptyTimeout and set a default value in options

Maybe I will push this in a separate PR. That function is used in ~30 places.

agent/agent_test.go

coderd/prometheusmetrics/aggregator.go

coderd/prometheusmetrics/aggregator_test.go

agent/agent_test.go

spikecurtis · 2023-05-25T06:34:33Z

I'm not convinced about the error-wrapping technique.

Let me challenge it. There are cases where we'd like to collect more than one error, and trying to aggregate a few of them requires extra logic (at least append operation).

That's fair---the high level goal is that each legitimately distinct error is counted exactly once. For errors that don't terminate the session, I think it's fine to just directly increment the error counter and move on.

What I want to avoid though, is double-counting, where an error type is incremented, and then a "generic error" count is also incremented after the function returns. We could limit the error wrapping to just those errors that trigger an immediate return in the session handler. The feature I really like about error wrapping is that it affords us the ability to ensure that we really are counting all the session-ending errors and not just forgetting in some return paths.

mtojek · 2023-05-25T07:09:32Z

I see your point, but considering the fact that we have to increase the error counter in two different ways, the argumentation does not convince me.

What I want to avoid though, is double-counting, where an error type is incremented, and then a "generic error" count is also incremented after the function returns.

It is a theoretical use case, as most crucial code paths fail fast and return (connection is terminated). There is a low chance that the same counter is incremented twice.

The feature I really like about error wrapping is that it affords us the ability to ensure that we really are counting all the session-ending errors and not just forgetting in some return paths.

This is a valid point, but in my opinion, it complicates the code logic too much - I suppose that this is a matter of personal preference.

I'm wondering if we can push this change in the follow-up iteration to deliver the feature first. What you're suggesting here is pure refactoring. Nevertheless, I will start doing it now 👍 .

spikecurtis · 2023-05-25T07:39:08Z

It is a theoretical use case, as most crucial code paths fail fast and return (connection is terminated). There is a low chance that the same counter is incremented twice.

I don't mean as some race condition --- I just mean counting an error at two different places in the stack. Your earlier PR incremented metrics both in sessionStart and then also in sessionHandler if it returned an error. Doesn't look like this PR does this any more, so it's fine not to implement error wrapping if you feel strongly about it.

spikecurtis · 2023-05-25T06:44:37Z

codersdk/agentsdk/agentsdk.go

+	Name   string             `json:"name" validate:"required"`
+	Type   AgentMetricType    `json:"type" validate:"required" enums:"counter,gauge"`
+	Value  float64            `json:"value" validate:"required"`
+	Labels []AgentMetricLabel `json:"labels,omitempty"`


labels are key-value pairs --- why not just use a map[string]string?

I prefer using slices over maps to keep the order of elements in transit. I didn't want to sort them specifically on the receiver side.

coderd/prometheusmetrics/aggregator.go

coderd/prometheusmetrics/aggregator_test.go

agent/agentssh/agentssh.go

spikecurtis · 2023-05-25T10:51:54Z

LGTM!

feat: Collect agent SSH metrics

092e1d0

mtojek self-assigned this May 17, 2023

mtojek mentioned this pull request May 17, 2023

Drill-down view: workspace network latency & disconnects #6724

Closed

mtojek added 2 commits May 17, 2023 15:51

more metrics

0c0df91

err

84b59ea

mtojek mentioned this pull request May 18, 2023

chore: add prometheus monitoring of workspace traffic generation #7583

Merged

mtojek added 2 commits May 18, 2023 12:52

session metrics

000586a

session error

11fb056

mtojek marked this pull request as ready for review May 18, 2023 11:16

mtojek requested a review from johnstcn May 18, 2023 11:17

fix

07b2b0e

mtojek requested a review from spikecurtis May 18, 2023 11:20

fmt

e38a7be

spikecurtis reviewed May 18, 2023

View reviewed changes

mtojek added 7 commits May 18, 2023 16:55

WIP

ba4bb4d

Refactored to client_golang/prometheus

0d0f300

fix

315b5ce

fix

43d5d40

refactor

85f8860

Merge branch 'main' into 6724-ssh-metrics

7b26267

refactor

34f07fc

mtojek added 5 commits May 18, 2023 18:04

fix test

59fd585

fix

8cd927c

fix

6eec4d7

fix

a059edf

fix

c004c04

spikecurtis reviewed May 19, 2023

View reviewed changes

cli/agent.go Show resolved Hide resolved

mtojek added 2 commits May 22, 2023 09:09

WIP

bb3602b

WIP

27fc9a0

johnstcn approved these changes May 22, 2023

View reviewed changes

agent/agentssh/agentssh.go Outdated Show resolved Hide resolved

Finish impl

3f4696b

mtojek added 3 commits May 23, 2023 14:45

Aggregator: labels

8cd07f2

Merge branch 'main' into 6724-ssh-metrics

a51cde9

TestAgent_Metrics_SSH

389dd9f

mtojek requested review from spikecurtis and johnstcn May 23, 2023 14:49

mtojek marked this pull request as ready for review May 23, 2023 14:49

johnstcn reviewed May 23, 2023

View reviewed changes

mtojek added 3 commits May 24, 2023 10:12

Address PR comments

8e10d6d

use labelIndex

1858dc2

Merge branch 'main' into 6724-ssh-metrics

8dde9f9

mtojek marked this pull request as draft May 25, 2023 06:58

Merge branch 'main' into 6724-ssh-metrics

db725a3

spikecurtis reviewed May 25, 2023

View reviewed changes

mtojek added 3 commits May 25, 2023 10:30

PR comments part 1

f416287

PR comments part 2

4daf37d

PR comments part 3

d9203b8

mtojek marked this pull request as ready for review May 25, 2023 10:08

mtojek requested a review from spikecurtis May 25, 2023 10:09

spikecurtis approved these changes May 25, 2023

View reviewed changes

mtojek merged commit 14efdad into coder:main May 25, 2023

github-actions bot locked and limited conversation to collaborators May 25, 2023

feat: Collect agent SSH metrics #7584

feat: Collect agent SSH metrics #7584

Uh oh!

Conversation

mtojek commented May 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

spikecurtis left a comment

Choose a reason for hiding this comment

Uh oh!

mtojek commented May 18, 2023

Uh oh!

mtojek commented May 18, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

spikecurtis May 19, 2023

Choose a reason for hiding this comment

Uh oh!

mtojek May 19, 2023

Choose a reason for hiding this comment

Uh oh!

spikecurtis May 19, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

spikecurtis May 19, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mtojek commented May 23, 2023

Uh oh!

johnstcn left a comment

Choose a reason for hiding this comment

Uh oh!

johnstcn May 23, 2023

Choose a reason for hiding this comment

Uh oh!

mtojek May 24, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

spikecurtis commented May 25, 2023

Uh oh!

mtojek commented May 25, 2023

Uh oh!

spikecurtis commented May 25, 2023

Uh oh!

spikecurtis May 25, 2023

Choose a reason for hiding this comment

Uh oh!

mtojek May 25, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

spikecurtis commented May 25, 2023

Uh oh!

Uh oh!

mtojek commented May 17, 2023 •

edited

Loading

mtojek May 25, 2023 •

edited

Loading