Skip to content

feat: show health check warnings #10620

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 13 commits into from
Closed

feat: show health check warnings #10620

wants to merge 13 commits into from

Conversation

mtojek
Copy link
Member

@mtojek mtojek commented Nov 10, 2023

Fixes: #9470

This PR extends the /api/v2/debug/health API response to include warning bool field. Coder will not report "unhealthy" status if there are troubles with Websockets.

Screenshot 2023-11-10 at 14 03 57

@mtojek mtojek self-assigned this Nov 10, 2023
@mtojek mtojek changed the title feat: show health warnings feat: show health check warnings Nov 10, 2023
},
database: {
healthy: true,
reachable: true,
latency: 92570,
error: null,
latency: "92570",
Copy link
Member

@johnstcn johnstcn Nov 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get rid of this field? (not in this PR)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many places in this API response are discussable. I second the question.

Comment on lines 37 to +40
// Healthy is true if the report returns no errors.
Healthy bool `json:"healthy"`
// Warning is true when Coder is operational but its performance might be impacted.
Warning bool `json:"warning"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we get a report that has both { Healthy: false, Warning: true }?
As far as I understand, that's an invalid state in our current model, which should not be representible.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, but I'm afraid that it will not be achievable without breaking API? I introduced the "warning" field to be perceived as "it is healthy, but...".

To solve it properly, most likely we need to replace "Healthy" with enum, for instance: status: healthy, status: degraded, status: error, etc.

I'm happy to chat about this as it is the right moment.

Copy link
Member

@johnstcn johnstcn Nov 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The more I think about it, it could be useful to have both health and warn information, as they are not necessarily completely separate.

For example, if you're using embedded database and access URL just points straight to the VM, and you get a failed healthcheck for database taking too long and warnings for access URL taking too long to respond, that can already give you an indication that perhaps the server is overloaded.

Maybe we could add

Warnings []string or map[string]string

at the top level, and any warnings from component reports get added in?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with this concept if only we present these warnings on the "Health" page 👍 Otherwise, it is pointless. It also means that the Health page will become more informative than just Healthy/Unhealthy + JSON dump.

Copy link
Member Author

@mtojek mtojek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR touches few important aspects we may want to discuss:

  1. Warning bool vs. Warning []string (Cian) vs "status enum". I don't have any preference, for Warning bool seems to be sufficient, but I don't know where we would like to go with the design. I wouldn't like to drive a new RFC just because of that.
  2. Can we move Healthcheck API response to codersdk?
  3. So, can "site" use API responses from the codersdk? Right now, it uses its own custom implementation. Unfortunately, it seems to be incorrect/not fully implemented.

I will ping a couple of folks so that we have an opinionated design.

@@ -2456,7 +2459,8 @@ export const MockHealth = {
TokenBucketBytesBurst: 0,
},
can_exchange_messages: true,
round_trip_ping: 7674330,
round_trip_ping: "7674330",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question as Cian posted for "latency". Is it backward compatibility?

@@ -2467,18 +2471,16 @@ export const MockHealth = {
],
],
client_errs: [[], []],
error: null,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does UI/CLI/something use "error" fields?

// A node may use websockets because `Upgrade: DERP` may be blocked on
// the load balancer. This is unhealthy because websockets are slower
// than the regular DERP protocol.
r.Warning = r.UsesWebsocket
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the most important logical change. Using WebSockets won't be considered to be a failure anymore.

export const getHealth = async () => {
const response = await axios.get<Health>("/api/v2/debug/health");
const response = await axios.get<TypesGen.HealthcheckReport>(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BrunoQuaresma Do you know why it uses Health instead of TypesGen.HealthcheckReport? Is there any strong reason why it does not use api/api?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extracted it here: #10650

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't remember but this is a good improvement for sure.

export const getHealth = async () => {
const response = await axios.get<Health>("/api/v2/debug/health");
const response = await axios.get<TypesGen.HealthcheckReport>(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BrunoQuaresma It seems that make gen does not generate DerphealthcheckReport even though it is marked to be generated. Do you have any idea why it happens?

@@ -2097,7 +2097,9 @@ export interface HealthcheckReport {
   readonly healthy: boolean;
   readonly warning: boolean;
   readonly failing_sections: string[];
-  readonly derp: DerphealthReport;
+  // Named type "github.com/coder/coder/v2/coderd/healthcheck/derphealth.Report" unknown, using "any"
+  // eslint-disable-next-line @typescript-eslint/no-explicit-any -- External type
+  readonly derp: any;
   readonly access_url: HealthcheckAccessURLReport;
   readonly websocket: HealthcheckWebsocketReport;
   readonly database: HealthcheckDatabaseReport

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extracted it here: #10650

Copy link
Collaborator

@BrunoQuaresma BrunoQuaresma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the FE perspective, it looks good

@mtojek
Copy link
Member Author

mtojek commented Nov 15, 2023

I'm going to redo this according to the redesign notion doc.

@mtojek mtojek closed this Nov 15, 2023
@github-actions github-actions bot locked and limited conversation to collaborators Nov 15, 2023
@github-actions github-actions bot deleted the 9470-fallback-is-healthy branch May 16, 2024 00:04
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

health check endpoint: do not show healthy: false if Coder falls back to websockets
3 participants