Skip to content

feat(coderd): add user latency and template insights endpoints #8519

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

mafredri
Copy link
Member

@mafredri mafredri commented Jul 14, 2023

This pull request implements the user latency and template insights endpoints, as per #8514.

Data available:

  • User latency for specified time range and templates
  • Template users, both for whole interval and daily
  • Template build-in app usage (vscode, jetbrains, web term and ssh) for whole interval

Missing (left for follow-up PRs):

  • Template parameter value/usage
  • Template custom app usage

Updates #8514
Refs #8109

@mafredri mafredri linked an issue Jul 14, 2023 that may be closed by this pull request
@mafredri mafredri force-pushed the mafredri/feat-coderd-add-user-latency-and-insights-endpoints branch from 8458e66 to c728857 Compare July 14, 2023 15:59
@mafredri mafredri force-pushed the mafredri/feat-coderd-add-user-latency-and-insights-endpoints branch from c728857 to aaadc6a Compare July 19, 2023 13:14
@mafredri
Copy link
Member Author

mafredri commented Jul 19, 2023

Status update: Base logic is mostly done, however, some timeframe edge cases are not yet handled and there are a few open questions. Tests also need to be added.

  • We currently accept input times with date+clock, we should perhaps either remove clock or enforce that it's 00:00:00.
  • Allowing clock skews the interval report since we're not locked into whole days

Questions:

  • What timezone should we use**?**, we could use deployment, user, or viewer
    • For simplicity/flexibility we can allow the dashboard to request whatever timezone it feels is best (as it is now)
    • For the most accurate (IMO), we'd use the user timezone so that when user A's Monday and users B's Monday appear on Monday in the data, even though they're on opposite sides of the globe
  • App usage, should this be cumulative or an average of all users? (Current implementation is cumulative/per user)

Issues:

  • We're relying heavily on workspace_agents_stats, however, user latency based on this table is skewed since it includes agent<->coderd latency
    • Also, if an admin e.g. does ssh otheruser/workspace, the admins session will be counted towards user stats as well (both latency and apps usage), that said, fixing this is not a high priority

Current response samples:

GET http://localhost:3000/api/v2/insights/user-latency?start_time=2023-07-18T00:00:00Z&end_time=2023-07-20T00:00:00Z
{
  "report": {
    "start_time": "2023-07-18T00:00:00Z",
    "end_time": "2023-07-20T00:00:00Z",
    "template_ids": [
      "bb20c5b7-3b3c-4031-868b-3aee8b80e79b"
    ],
    "users": [
      {
        "template_ids": [
          "bb20c5b7-3b3c-4031-868b-3aee8b80e79b"
        ],
        "user_id": "a6d24688-2c59-423a-a8b0-41d56331d4a1",
        "username": "admin",
        "latency_ms": {
          "p50": 0.327,
          "p95": 0.5352
        }
      },
      {
        "template_ids": [],
        "user_id": "e7217d35-9ce1-4c98-ae09-c6c9ad6f5686",
        "username": "member",
        "latency_ms": null
      },
      {
        "template_ids": [
          "bb20c5b7-3b3c-4031-868b-3aee8b80e79b"
        ],
        "user_id": "d9fb2aa8-fedb-4c76-a759-27e4e4c5d550",
        "username": "woot",
        "latency_ms": {
          "p50": 0.59,
          "p95": 0.59
        }
      }
    ]
  }
}
GET http://localhost:3000/api/v2/insights/templates?start_time=2023-07-18T00:00:00Z&end_time=2023-07-20T00:00:00Z&interval=day
{
  "report": {
    "start_time": "2023-07-18T00:00:00Z",
    "end_time": "2023-07-20T00:00:00Z",
    "template_ids": [
      "bb20c5b7-3b3c-4031-868b-3aee8b80e79b"
    ],
    "active_users": 2,
    "apps_usage": [
      {
        "template_ids": [
          "bb20c5b7-3b3c-4031-868b-3aee8b80e79b"
        ],
        "type": "builtin",
        "display_name": "Visual Studio Code",
        "slug": "vscode",
        "icon": "/icons/code.svg",
        "seconds": 0
      },
      {
        "template_ids": [
          "bb20c5b7-3b3c-4031-868b-3aee8b80e79b"
        ],
        "type": "builtin",
        "display_name": "JetBrains",
        "slug": "jetbrains",
        "icon": "/icons/intellij.svg",
        "seconds": 0
      },
      {
        "template_ids": [
          "bb20c5b7-3b3c-4031-868b-3aee8b80e79b"
        ],
        "type": "builtin",
        "display_name": "Web Terminal",
        "slug": "reconnecting-pty",
        "icon": "/icons/terminal.svg",
        "seconds": 4800
      },
      {
        "template_ids": [
          "bb20c5b7-3b3c-4031-868b-3aee8b80e79b"
        ],
        "type": "builtin",
        "display_name": "SSH",
        "slug": "ssh",
        "icon": "/icons/terminal.svg",
        "seconds": 1800
      }
    ]
  },
  "interval_reports": [
    {
      "start_time": "2023-07-18T00:00:00Z",
      "end_time": "2023-07-19T00:00:00Z",
      "template_ids": [
        "bb20c5b7-3b3c-4031-868b-3aee8b80e79b"
      ],
      "interval": "day",
      "active_users": 2
    },
    {
      "start_time": "2023-07-19T00:00:00Z",
      "end_time": "2023-07-20T00:00:00Z",
      "template_ids": [
        "bb20c5b7-3b3c-4031-868b-3aee8b80e79b"
      ],
      "interval": "day",
      "active_users": 1
    }
  ]
}

@mtojek
Copy link
Member

mtojek commented Jul 20, 2023

We currently accept input times with date+clock, we should perhaps either remove clock or enforce that it's 00:00:00.
Allowing clock skews the interval report since we're not locked into whole days

Thinking in terms of MVP, it is something we can iterate later. IMHO we should start with whole days, and exceptionally allow for the current day to be counted in hours (not 24h yet).

For simplicity/flexibility we can allow the dashboard to request whatever timezone it feels is best (as it is now)

Alternatively, you can depend on the deployment time+TZ, and just share it over Insights API, so that UI can consider it.

For the most accurate (IMO), we'd use the user timezone so that when user A's Monday and users B's Monday appear on
Monday in the data, even though they're on opposite sides of the globe

DevOps people are probably aware of this issue, so they will be fine with any option :) Do you find any of them easier to implement?

App usage, should this be cumulative or an average of all users? (Current implementation is cumulative/per user)
Issues:

cumulative/per user 👍 BUT... how will the API response look like if we are doing scale tests for 1000 users? AFAIR I asked about this before.

Also, if an admin e.g. does ssh otheruser/workspace, the admins session will be counted towards user stats as well (both latency and apps usage), that said, fixing this is not a high priority

I would park it for later 👍

@mafredri
Copy link
Member Author

Thinking in terms of MVP, it is something we can iterate later. IMHO we should start with whole days, and exceptionally allow for the current day to be counted in hours (not 24h yet).

Agreed, I'll add the "today" exception so that it's possible for the site to show stable numbers for today.

Alternatively, you can depend on the deployment time+TZ, and just share it over Insights API, so that UI can consider it.

Sure, we can do that. We'll have to remove TZ from the start_time/end_time request format as well.

DevOps people are probably aware of this issue, so they will be fine with any option :) Do you find any of them easier to implement?

Since we don't have knowledge of user TZ, let's go with the simple approach for now. 😄

[...] how will the API response look like if we are doing scale tests for 1000 users? AFAIR I asked about this before.

I guess the scaletest numbers would be included, unless there's some special handling to not gather stats in this case? It's unclear to me whether or not we'd want to filter them out, will scale testing be a common/recurring practice on live deployments?

@mtojek
Copy link
Member

mtojek commented Jul 20, 2023

I guess the scaletest numbers would be included, unless there's some special handling to not gather stats in this case? It's unclear to me whether or not we'd want to filter them out, will scale testing be a common/recurring practice on live deployments?

IMHO we have to look at scaletesting from a different angle. Running load tests with traffic ~1000 users means that any customer can potentially create such a large deployment. Coder API must sustain this load too, and in this case, I'm mostly concerned about the response size for the Insights endpoint.

@johnstcn
Copy link
Member

What timezone should we use**?**, we could use deployment, user, or viewer

The principle of least surprise would be that the FE specifies the timezone to be used, auto-detecting from the browser.

App usage, should this be cumulative or an average of all users? (Current implementation is cumulative/per user)

Cumulative across all users makes the most sense here, I think. Per-user averages are too widely skewed.

I guess the scaletest numbers would be included, unless there's some special handling to not gather stats in this case? It's unclear to me whether or not we'd want to filter them out, will scale testing be a common/recurring practice on live deployments?

The scaletest command is hidden, so we should not expect it to be used regularly in the field.Th at having been said, all scaletest users and workspaces are prefixed with the string scaletest-, so excluding them from the /insights endpoint should not be too difficult. But maybe we do want to leverage scaletests to generate data for the insights endpoint? It should be performant even with a large amount of data.

// Example:
// - I want data from Monday - Friday
// - I'm UTC+3 and the deployment is UTC+0
// - Do we select Monday - Friday in UTC+0 or UTC+3?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Friday UTC+3, using timezone from the incoming request. Coder admins won't necessarily know or care about the localtime of the server on which the deployment is running.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One factor here is that two people looking at "the same data" may see different numbers. They may share screen captures and wonder why the data doesn't add up. An anchor like deployment TZ would help in that case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, showing the user's relative difference from the deployment TZ (or even just showing the GMT offset in the graph) would help indicate this.

@mafredri
Copy link
Member Author

@mtojek

IMHO we have to look at scaletesting from a different angle. Running load tests with traffic ~1000 users means that any customer can potentially create such a large deployment. Coder API must sustain this load too, and in this case, I'm mostly concerned about the response size for the Insights endpoint.

Ah right, this mainly affects the user-latency endpoint which we can add paginations to when/if needed. I think we can punt pagination for later, though. All stats in the templates endpoint are aggregated so we don't have the same scale issue there.

@johnstcn

The principle of least surprise would be that the FE specifies the timezone to be used, auto-detecting from the browser.

I'd be fine with this, it's easier to handle on the backend too (as-is currently) 😄, thoughts @mtojek?

Cumulative across all users makes the most sense here, I think. Per-user averages are too widely skewed.

Ah, I believe I've worded that too open ended. It's currently cumulative across all users, however, with per-user I meant group by, this e.g. comes into play in the following scenario:

User A has two workspaces and a VSCode sessions open to both workspaces from 9am to 10am. User A contributes to cumulative stats with 1h of VSCode usage (vs 2h). Same if it's two VSCode sessions open in one workspace, etc.

It's perhaps a weird line to draw, though, since the user can contribute with 1h vscode + 1h jetbrains + 1h terminal usage. The idea is anyway that a user can only focus on one app at a time.

But maybe we do want to leverage scaletests to generate data for the insights endpoint? It should be performant even with a large amount of data.

I agree. 👍🏻

@johnstcn
Copy link
Member

The idea is anyway that a user can only focus on one app at a time.

I wouldn't necessarily make that assumption ;-)

@mafredri
Copy link
Member Author

mafredri commented Jul 20, 2023

I wouldn't necessarily make that assumption ;-)

I think it's a matter of making the data useful. For instance, take a deployment of one user with 8 workspaces based on one template. At the start of the day they launch a VSCode for each workspace and during the day keeps hopping between editor windows. Given the current implementation, we will see 8h of VSCode usage (logical, one employee working a full day), whereas without this assumption, we see 64h of VSCode usage for one day (confusing).

I think this is a question for @BrunoQuaresma and @bpmct, though, which matches their vision better?

@mafredri mafredri force-pushed the mafredri/feat-coderd-add-user-latency-and-insights-endpoints branch from 00a9b6a to 167719c Compare July 20, 2023 16:38
@mafredri mafredri force-pushed the mafredri/feat-coderd-add-user-latency-and-insights-endpoints branch from 167719c to 2e51056 Compare July 20, 2023 16:44
@mafredri
Copy link
Member Author

Decisions made:

  • User latency pagination will be left for later (/ as needed)
  • Client provides timezone as part of start_time/end_time, can use deployment tz if that's preferred
  • Time ranges can't be in the future, except "Today" which has a special case and if the clock is 15:33:00, an end time of 16:00:00 is OK (this includes all the latest data), although 15:00:00 may be preferable. Using anything other than 00:00:00 outside of "Today" is not allowed
  • Assuming admin permissions, so all data is "fair game"

Things we may want to improve in the future:

  • User latency, the data stored in workspace_agent_stats is suboptimal and includes agent<->coderd latency as well, not only user
  • App usage calculations, right now we only know that a connection from e.g. VSCode is open, but not whether or not the user is actively using VSCode

@mafredri mafredri force-pushed the mafredri/feat-coderd-add-user-latency-and-insights-endpoints branch from 4ffc5f1 to 088620e Compare July 21, 2023 09:30
@mtojek mtojek self-requested a review July 21, 2023 13:09
Copy link
Member

@mtojek mtojek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless @bpmct or @BrunoQuaresma has some objections, I think that this is in good shape to get merged. You can iterate on improvements in follow-ups.

Copy link
Member

@johnstcn johnstcn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I think this is in a good shape to start iterating. :shipit:

@mafredri mafredri enabled auto-merge (squash) July 21, 2023 14:07
@mafredri mafredri disabled auto-merge July 21, 2023 15:43
@mafredri mafredri force-pushed the mafredri/feat-coderd-add-user-latency-and-insights-endpoints branch 2 times, most recently from 4398e9f to 7fd13df Compare July 21, 2023 16:31
@mafredri mafredri enabled auto-merge (squash) July 21, 2023 16:33
@mafredri mafredri merged commit 30fe153 into main Jul 21, 2023
@mafredri mafredri deleted the mafredri/feat-coderd-add-user-latency-and-insights-endpoints branch July 21, 2023 18:00
@github-actions github-actions bot locked and limited conversation to collaborators Jul 21, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement API endpoint for template/deployment insights
4 participants