Skip to content

fix: make GetWorkspacesEligibleForTransition return even less false positives #15594

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 36 commits into from
Dec 2, 2024

Conversation

DanielleMaywood
Copy link
Contributor

@DanielleMaywood DanielleMaywood commented Nov 19, 2024

Relates to #15082

Further to #15429, this reduces the amount of false-positives returned by the 'is eligible for autostart' part of the query. We achieve this by calculating the 'next start at' time of the workspace, storing it in the database, and using it in our GetWorkspacesEligibleForTransition query.

The prior implementation of the 'is eligible for autostart' query would return all workspaces that at some point in the future might be eligible for autostart. This now ensures we only return workspaces that should be eligible for autostart.

We also now pass currentTick instead of t to the GetWorkspacesEligibleForTransition query as otherwise we'll have one round of workspaces that are skipped by isEligibleForTransition due to currentTick being a truncated version of t.

@DanielleMaywood DanielleMaywood marked this pull request as ready for review November 20, 2024 15:47
@DanielleMaywood DanielleMaywood changed the title fix: make GetWorkspacesEligibleForTransition return no false positives fix: make GetWorkspacesEligibleForTransition return even less false positives Nov 21, 2024
@dannykopping dannykopping removed their request for review November 21, 2024 10:54
@dannykopping
Copy link
Contributor

Removed myself as reviewer; Cian & Mathias have much more context here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Obligatory reminder to double-check migration number before merge 👍

Copy link
Member

@mafredri mafredri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this approach looks fine on the surface level, but I'm a bit worried that I'm not seeing any invalidation for "next start at". I.e. changes that mean we should disregard or recompute it.

For instance, it could be set in the future, but a change is done to start it sooner. Will it propagate?

Just a few thoughts that pop into my mind:

  • What happens if a template scheduling options are changed?
  • What happens if workspace settings are updated (e.g. autostart schedule)?
  • What happens if a workspace is updated to a new template version
  • What happens if a user is disabled after the nextStartAt has been set (this might be OK, just mentioning for completeness)

@DanielleMaywood
Copy link
Contributor Author

DanielleMaywood commented Nov 21, 2024

What happens if a template scheduling options are changed?
What happens if a workspace is updated to a new template version

It appears I completely forgot to consider the template being updated, my bad, will resolve this! Thanks for spotting 👍

What happens if workspace settings are updated (e.g. autostart schedule)?

I've handled this (as far as I can tell), the /workspaces/{workspace}/autostart [put] endpoint updates the next_start_at column.

What happens if a user is disabled after the nextStartAt has been set (this might be OK, just mentioning for completeness)

The query has users.status = 'active'::user_status in it. So it will not return that is eligible for autostart.

Copy link
Member

@mafredri mafredri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! Seeing it, I'm not sure how I feel about the db trigger approach, but I like the enforcement and think it's worth giving it a shot. I think we still have some possibilities for data inconsistency without a trigger on the templates table, though.

// We do not know if workspace with a zero next start is eligible
// for autostart, so we accept this false-positive. This can occur
// when a coder version is upgraded and next_start_at has yet to
// be set.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a new known case, reset by db trigger. Should we implement the trigger in memory db as well? (I know it's probably going away soon, but still, might cause flakes?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it'd cause flakes, but instead cause dbmem tests to fail when postgres tests do not. Happy to add it though

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, I've checked dbmem.go for how we emulate triggers and I'm not sure there is a good way to do it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no good way to do it, just ad hoc code in various functions 😢

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah that's unfortunate.

I think attempting to emulate this trigger isn't worth it then. If someone writes a test that is dependent on the trigger (which they ideally shouldn't) then they'll have to disable it when running under dbmem.go.

CREATE TRIGGER trigger_update_workspaces_schedule
AFTER UPDATE ON workspaces
FOR EACH ROW
EXECUTE PROCEDURE nullify_workspace_next_start_at();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking we also need a trigger for changes to a template time-til-dormant? Or do I misunderstand that interaction?

My thought process: if the time-til-dormant is reduced, we might've previously calculated a next start for the workspace which might no longer be valid because dormancy was sped up.

It's probably a fringe edge-case, but might be hard to debug customer reports if they run into it.

I'm thinking this applies to template allowed days changing as well. Tomorrow might've been blocked but is now unblocked yet we computed the autostart for the day after.

I'm wondering if there are other such interactions between template and workspace we should cover as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I agree with you here, I'll try and think if there are any other similar interactions 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking we also need a trigger for changes to a template time-til-dormant? Or do I misunderstand that interaction?

My thought process: if the time-til-dormant is reduced, we might've previously calculated a next start for the workspace which might no longer be valid because dormancy was sped up.

After digging into this a little bit more, I don't think this is an issue.

I had forgotten to include dormant_at IS NOT NULL in autostart part of the GetWorkspacesEligibleForTransition SQL query, so I've added that now (it already existed in the coderd lifecycle executor side).

Instead of handling changes to time_til_dormancy, I've instead modified the existing trigger to check if the workspace is dormant. When a workspace is marked as dormant, the workspace's next_start_at is made NULL.

I'm thinking this applies to template allowed days changing as well. Tomorrow might've been blocked but is now unblocked yet we computed the autostart for the day after.

This edge case has been handled in the EnterpriseTemplateScheduleStore

Copy link
Member

@johnstcn johnstcn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving to unblock.

@DanielleMaywood DanielleMaywood merged commit e21a301 into main Dec 2, 2024
31 checks passed
@DanielleMaywood DanielleMaywood deleted the dm-experiment-autostart branch December 2, 2024 21:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants