Skip to content

feat: Collect agent SSH metrics #7584

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 42 commits into from
May 25, 2023
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
42 commits
Select commit Hold shift + click to select a range
092e1d0
feat: Collect agent SSH metrics
mtojek May 17, 2023
0c0df91
more metrics
mtojek May 17, 2023
84b59ea
err
mtojek May 17, 2023
000586a
session metrics
mtojek May 18, 2023
11fb056
session error
mtojek May 18, 2023
07b2b0e
fix
mtojek May 18, 2023
e38a7be
fmt
mtojek May 18, 2023
ba4bb4d
WIP
mtojek May 18, 2023
0d0f300
Refactored to client_golang/prometheus
mtojek May 18, 2023
315b5ce
fix
mtojek May 18, 2023
43d5d40
fix
mtojek May 18, 2023
85f8860
refactor
mtojek May 18, 2023
7b26267
Merge branch 'main' into 6724-ssh-metrics
mtojek May 18, 2023
34f07fc
refactor
mtojek May 18, 2023
59fd585
fix test
mtojek May 18, 2023
8cd927c
fix
mtojek May 18, 2023
6eec4d7
fix
mtojek May 18, 2023
a059edf
fix
mtojek May 18, 2023
c004c04
fix
mtojek May 18, 2023
9b0e31a
Address PR comments
mtojek May 19, 2023
7d4ccce
x11HostnameError
mtojek May 19, 2023
90b351d
Remove callbacks
mtojek May 19, 2023
1773c24
failedConnectionsTotal
mtojek May 19, 2023
5ac27b7
connectionsTotal
mtojek May 19, 2023
6eb1a95
sftpConnectionsTotal
mtojek May 19, 2023
e3d7493
sessionError
mtojek May 19, 2023
9620452
sftpServerErrors
mtojek May 19, 2023
f05466c
remove handlerError
mtojek May 19, 2023
5887ee8
WIP
mtojek May 19, 2023
bb3602b
WIP
mtojek May 22, 2023
27fc9a0
WIP
mtojek May 22, 2023
3f4696b
Finish impl
mtojek May 23, 2023
8cd07f2
Aggregator: labels
mtojek May 23, 2023
a51cde9
Merge branch 'main' into 6724-ssh-metrics
mtojek May 23, 2023
389dd9f
TestAgent_Metrics_SSH
mtojek May 23, 2023
8e10d6d
Address PR comments
mtojek May 24, 2023
1858dc2
use labelIndex
mtojek May 24, 2023
8dde9f9
Merge branch 'main' into 6724-ssh-metrics
mtojek May 24, 2023
db725a3
Merge branch 'main' into 6724-ssh-metrics
mtojek May 25, 2023
f416287
PR comments part 1
mtojek May 25, 2023
4daf37d
PR comments part 2
mtojek May 25, 2023
d9203b8
PR comments part 3
mtojek May 25, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
session error
  • Loading branch information
mtojek committed May 18, 2023
commit 11fb0568ae71bcc0f922d132cb3f2cb547a3270e
21 changes: 16 additions & 5 deletions agent/agentssh/agentssh.go
Original file line number Diff line number Diff line change
Expand Up @@ -198,25 +198,37 @@ func (s *Server) sessionHandler(session ssh.Session) {
return
}

err := s.sessionStart(session, extraEnv)
m := metricsForSession(magicType(session))
err := s.sessionStart(session, m, extraEnv)
var exitError *exec.ExitError
if xerrors.As(err, &exitError) {
s.logger.Debug(ctx, "ssh session returned", slog.Error(exitError))
s.logger.Warn(ctx, "ssh session returned", slog.Error(exitError))
m.sessionError.Add(1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This construction where we increment error metrics both here and down the stack where they occur means that we count the same error in more than one metric. This makes sums over the error metrics very misleading.

A better design is to only increment the metrics here in this function by selecting on the type of error returned by s.sessionStart.

You could create an error wrapper e.g.

type metricError struct {
    err error
    labels prometheus.Labels
} 

func (m metricError) Error() string {
    return m.Unwrap().Error()
}

func (m metricError) Unwrap() error {
    return m.err
}

Then use xerrrors.As to check for this type and increment accordingly. If we forget to wrap some errors, we can have an "unknown" error type that we increment in this case, which could signal us to go back and wrap the missing error.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This construction where we increment error metrics both here and down the stack where they occur means that we count the same error in more than one metric.

I shall disagree, as having a generic metric to indicate that a session error happened is much easier to set up alarming. You don't need to watch every single metric and eventually miss newly added ones, but just have one to depend on.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In prometheus it's easy to query for the sum across the label of error metrics. No chance of missing one if more label values are added later.

_ = session.Exit(exitError.ExitCode())
return
}
if err != nil {
s.logger.Warn(ctx, "ssh session failed", slog.Error(err))
// This exit code is designed to be unlikely to be confused for a legit exit code
// from the process.
metricSessionError.Add(1)
m.sessionError.Add(1)
_ = session.Exit(MagicSessionErrorCode)
return
}
_ = session.Exit(0)
}

func (s *Server) sessionStart(session ssh.Session, extraEnv []string) (retErr error) {
func magicType(session ssh.Session) string {
for _, kv := range session.Environ() {
if !strings.HasPrefix(kv, MagicSessionTypeEnvironmentVariable) {
continue
}
return strings.TrimPrefix(kv, MagicSessionTypeEnvironmentVariable+"=")
}
return ""
}

func (s *Server) sessionStart(session ssh.Session, m sessionMetricsObject, extraEnv []string) (retErr error) {
ctx := session.Context()
env := append(session.Environ(), extraEnv...)
var magicType string
Expand All @@ -241,7 +253,6 @@ func (s *Server) sessionStart(session ssh.Session, extraEnv []string) (retErr er
s.logger.Warn(ctx, "invalid magic ssh session type specified", slog.F("type", magicType))
}

m := metricsForSession(magicType)
cmd, err := s.CreateCommand(ctx, session.RawCommand(), env)
if err != nil {
m.agentCreateCommandError.Add(1)
Expand Down
4 changes: 2 additions & 2 deletions agent/agentssh/metrics.go
Original file line number Diff line number Diff line change
Expand Up @@ -14,8 +14,6 @@ var (
metricReversePortForwardingCallback = clientmetric.NewCounter("ssh_reverse_port_forwarding_callback")
metricX11Callback = clientmetric.NewCounter("ssh_x11_callback")

metricSessionError = clientmetric.NewCounter("ssh_session_error")

// PTY sessions

// SFTP
Expand All @@ -37,6 +35,7 @@ type sessionMetricsObject struct {
agentListenerError *clientmetric.Metric
startPTYSession *clientmetric.Metric
startNonPTYSession *clientmetric.Metric
sessionError *clientmetric.Metric

// Non-PTY sessions
nonPTYStdinPipeError *clientmetric.Metric
Expand All @@ -60,6 +59,7 @@ func init() {
agentListenerError: clientmetric.NewCounter(fmt.Sprintf("ssh_agent_%s_listener_error", magicType)),
startPTYSession: clientmetric.NewCounter(fmt.Sprintf("ssh_agent_%s_start_pty_session", magicType)),
startNonPTYSession: clientmetric.NewCounter(fmt.Sprintf("ssh_agent_%s_start_non_pty_session", magicType)),
sessionError: clientmetric.NewCounter(fmt.Sprintf("ssh_agent_%s_session_error", magicType)),

nonPTYStdinPipeError: clientmetric.NewCounter(fmt.Sprintf("ssh_server_%s_non_pty_stdin_pipe_error", magicType)),
nonPTYStdinIoCopyError: clientmetric.NewCounter(fmt.Sprintf("ssh_server_%s_non_pty_stdin_io_copy_error", magicType)),
Expand Down