Skip to content

feat: show health check warnings #10620

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 13 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions coderd/apidoc/docs.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

13 changes: 13 additions & 0 deletions coderd/apidoc/swagger.json

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

18 changes: 14 additions & 4 deletions coderd/healthcheck/derphealth/derp.go
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ import (
// @typescript-generate Report
type Report struct {
Healthy bool `json:"healthy"`
Warning bool `json:"warning"`

Regions map[int]*RegionReport `json:"regions"`

Expand All @@ -41,6 +42,7 @@ type Report struct {
type RegionReport struct {
mu sync.Mutex
Healthy bool `json:"healthy"`
Warning bool `json:"warning"`

Region *tailcfg.DERPRegion `json:"region"`
NodeReports []*NodeReport `json:"node_reports"`
Expand All @@ -53,6 +55,7 @@ type NodeReport struct {
clientCounter int

Healthy bool `json:"healthy"`
Warning bool `json:"warning"`
Node *tailcfg.DERPNode `json:"node"`

ServerInfo derp.ServerInfoMessage `json:"node_info"`
Expand Down Expand Up @@ -108,6 +111,9 @@ func (r *Report) Run(ctx context.Context, opts *ReportOptions) {
if !regionReport.Healthy {
r.Healthy = false
}
if regionReport.Warning {
r.Warning = true
}
mu.Unlock()
}()
}
Expand Down Expand Up @@ -159,6 +165,9 @@ func (r *RegionReport) Run(ctx context.Context) {
if !nodeReport.Healthy {
r.Healthy = false
}
if nodeReport.Warning {
r.Warning = true
}
r.mu.Unlock()
}()
}
Expand Down Expand Up @@ -208,14 +217,15 @@ func (r *NodeReport) Run(ctx context.Context) {

// We can't exchange messages with the node,
if (!r.CanExchangeMessages && !r.Node.STUNOnly) ||
// A node may use websockets because `Upgrade: DERP` may be blocked on
// the load balancer. This is unhealthy because websockets are slower
// than the regular DERP protocol.
r.UsesWebsocket ||
// The node was marked as STUN compatible but the STUN test failed.
r.STUN.Error != nil {
r.Healthy = false
}

// A node may use websockets because `Upgrade: DERP` may be blocked on
// the load balancer. This is unhealthy because websockets are slower
// than the regular DERP protocol.
r.Warning = r.UsesWebsocket
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the most important logical change. Using WebSockets won't be considered to be a failure anymore.

}

func (r *NodeReport) doExchangeMessage(ctx context.Context) {
Expand Down
2 changes: 1 addition & 1 deletion coderd/healthcheck/derphealth/derp_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -174,7 +174,7 @@ func TestDERP(t *testing.T) {
for _, region := range report.Regions {
assert.False(t, region.Healthy)
for _, node := range region.NodeReports {
assert.False(t, node.Healthy)
assert.True(t, node.Healthy)
assert.True(t, node.CanExchangeMessages)
assert.NotEmpty(t, node.RoundTripPing)
assert.Len(t, node.ClientLogs, 2)
Expand Down
5 changes: 5 additions & 0 deletions coderd/healthcheck/healthcheck.go
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,8 @@ type Report struct {
Time time.Time `json:"time"`
// Healthy is true if the report returns no errors.
Healthy bool `json:"healthy"`
// Warning is true when Coder is operational but its performance might be impacted.
Warning bool `json:"warning"`
Comment on lines 32 to +35
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we get a report that has both { Healthy: false, Warning: true }?
As far as I understand, that's an invalid state in our current model, which should not be representible.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, but I'm afraid that it will not be achievable without breaking API? I introduced the "warning" field to be perceived as "it is healthy, but...".

To solve it properly, most likely we need to replace "Healthy" with enum, for instance: status: healthy, status: degraded, status: error, etc.

I'm happy to chat about this as it is the right moment.

Copy link
Member

@johnstcn johnstcn Nov 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The more I think about it, it could be useful to have both health and warn information, as they are not necessarily completely separate.

For example, if you're using embedded database and access URL just points straight to the VM, and you get a failed healthcheck for database taking too long and warnings for access URL taking too long to respond, that can already give you an indication that perhaps the server is overloaded.

Maybe we could add

Warnings []string or map[string]string

at the top level, and any warnings from component reports get added in?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with this concept if only we present these warnings on the "Health" page 👍 Otherwise, it is pointless. It also means that the Health page will become more informative than just Healthy/Unhealthy + JSON dump.

// FailingSections is a list of sections that have failed their healthcheck.
FailingSections []string `json:"failing_sections"`

Expand Down Expand Up @@ -139,6 +141,9 @@ func Run(ctx context.Context, opts *ReportOptions) *Report {
if !report.DERP.Healthy {
report.FailingSections = append(report.FailingSections, SectionDERP)
}
if report.DERP.Warning {
report.Warning = true
}
if !report.AccessURL.Healthy {
report.FailingSections = append(report.FailingSections, SectionAccessURL)
}
Expand Down
23 changes: 23 additions & 0 deletions coderd/healthcheck/healthcheck_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,7 @@ func TestHealthcheck(t *testing.T) {
name string
checker *testChecker
healthy bool
warning bool
failingSections []string
}{{
name: "OK",
Expand Down Expand Up @@ -77,6 +78,26 @@ func TestHealthcheck(t *testing.T) {
},
healthy: false,
failingSections: []string{healthcheck.SectionDERP},
}, {
name: "DERPWarning",
checker: &testChecker{
DERPReport: derphealth.Report{
Healthy: true,
Warning: true,
},
AccessURLReport: healthcheck.AccessURLReport{
Healthy: true,
},
WebsocketReport: healthcheck.WebsocketReport{
Healthy: true,
},
DatabaseReport: healthcheck.DatabaseReport{
Healthy: true,
},
},
healthy: true,
warning: true,
failingSections: nil,
}, {
name: "AccessURLFail",
checker: &testChecker{
Expand Down Expand Up @@ -151,8 +172,10 @@ func TestHealthcheck(t *testing.T) {
})

assert.Equal(t, c.healthy, report.Healthy)
assert.Equal(t, c.warning, report.Warning)
assert.Equal(t, c.failingSections, report.FailingSections)
assert.Equal(t, c.checker.DERPReport.Healthy, report.DERP.Healthy)
assert.Equal(t, c.checker.DERPReport.Warning, report.DERP.Warning)
assert.Equal(t, c.checker.AccessURLReport.Healthy, report.AccessURL.Healthy)
assert.Equal(t, c.checker.WebsocketReport.Healthy, report.Websocket.Healthy)
assert.NotZero(t, report.Time)
Expand Down
16 changes: 11 additions & 5 deletions docs/api/debug.md

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading