Skip to content

Roll up workspace and app stats into a separate table and use it for querying insights and deployment DAUs #12122

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mafredri opened this issue Feb 13, 2024 · 14 comments
Assignees

Comments

@mafredri
Copy link
Member

mafredri commented Feb 13, 2024

Problem

Currently we store raw agent stats for 6 months and app stats indefinitely. On dev.coder.com, the agent stats table takes up 6.3 GB of storage and it's a fairly small deployment. We've managed to optimize our queries fairly well, but the raw data is still large and slow to query. App stats are not as bad as they're only written on a as-needed basis, compressing time-ranges into a single row, as possible.

To make insights more performant, and allow us to store them for extended periods of time (years), we need to store them in a format that can be query efficiently and doesn't use excessive amounts of storage.

Limitations

Template parameter usage is also a part of insights, but is not covered in this proposal.

Requirements

  • Generate reports on usage at a certain granularity (e.g. daily, weekly, monthly)
  • Store usage data in a database at a granularity that allows for timezone adjustments (e.g. 15m, 30m, 1h)
    • 15m covers most timezones, but has 4x the storage requirements of 1h
  • Store app usage by template and user (user is also needed for unique counts across templates)
  • Compute total usage by user
  • Store latency for user by template

Use cases

  • Deployment DAUs
  • Template insights

Design

We will roll up the data into a format that is queryable and more efficient to store. The rollup will happen on an hourly or daily schedule. If we require live-data, we can utilize the same rollup queries to compute this from the raw data. Once the data is rolled up, we are free to delete data from either workspace_agent_stats or workspace_app_stats, however, there is no evidence that we need to delete from workspace_app_stats at this time.

We will store the data in a table called template_insights. This table will have the following columns:

  • id (bigserial)
  • start_time (timestamptz)
  • end_time (timestamptz)
  • template_id (uuid)
  • user_id (uuid)
  • latency_ms (smallint)
  • usage_mins (smallint)
  • ssh_mins (smallint)
  • vscode_mins (smallint)
  • jetbrains_mins (smallint)
  • webterminal_mins (smallint)
  • app_usage_mins (jsonb)
    • object {name1: int, name2: int, ...}

Note that we're storing the minutes so that we can properly account for usage that is smaller than the configured rollup range (be it 15m or 1h). The usage_mins field stores the accurate total usage within the timeframe (let's say you use ssh for 10 mins, then disconnect and start using code-server for 5 mins, usage_mins will reflect this and show 15 mins).

With this format, if we assume that a user uses 1 template at a time, and that we have 1000 users who each use coder 8 hours/work day for a year, we would have 1000 * 8 * 365 = 2,920,000 rows. If we store this at 30m granularity, we would have 2,920,000 * 2 = 5,840,000 rows. This is a manageable amount of data to store and query.

Data type storage sizes are:

  • bigserial: 8 bytes
  • timestamptz: 8 bytes
  • uuid: 16 bytes
  • smallint: 2 bytes
  • jsonb: ~35 bytes/app (depending on length of name)
select pg_column_size('{"code-server": 60}'::jsonb);
=> 36

There are ways to optimize app_usage storage requirements further, but this is a good starting point. Furthermore, if we don't need to query the JSON, or don't mind the compute overhead of live conversion, we can simply store it as json, greatly reducing the storage requirement.

select pg_column_size('{"code-server":60}'::json);
=> 22

With this design, storage size will approximate: 8 + 8 + 16 + 2 + 2 + 2 + 2 + 2 + 2 + 36 = 80 bytes/row

If we store 5,840,000 rows, we would need 5,840,000 * 80 = ~0.5 GB of storage. This is seems acceptable for one year of data for 1000 users. If a customer feels this is too much, we can make the retention configurable.

Note: These calculations don't take into account the the storage space used by whatever indexes we end up creating.

Related: https://github.com/coder/customers/issues/384, #9301

@spikecurtis
Copy link
Contributor

I don't think the bigserial id is necessary. The combination of (start, end, template ID, user ID) is a unique primary key. We will only ever query or delete based on one or more of these.

@mtojek
Copy link
Member

mtojek commented Feb 22, 2024

Thank you for describing the details.

ssh_mins (smallint)
vscode_mins (smallint)
jetbrains_mins (smallint)
webterminal_mins (smallint)

Should we keep separate columns for these *_mins fields or is it more clean to have generic mins + type field? My main concern with what you proposed is the necessity of modifying the schema to add a column if we decide to support next native application.

If a customer feels this is too much, we can make the retention configurable.

I would consider it to be MVP requirement. Different customers may have varying requirements and preferences regarding data retention.

start_time (timestamptz)
end_time (timestamptz)
template_id (uuid)
user_id (uuid)

Feature idea: would it be possible to build detailed monitoring based on this? Query all users that played with the application "Foobar" on Thursday afternoon.

Note: These calculations don't take into account the the storage space used by whatever indexes we end up creating

Yeah, you may want to flush out more on the indexing strategy

@spikecurtis
Copy link
Contributor

I vote for 30m increments. This allows our days to exactly align with India's timezone. The only timezone a quarter hour offset from UTC with any significant inhabitants is Nepal (there are 2 others, one in Australia and one in New Zealand that have less than 1000 inhabitants each). DAUs that roll over at 11:45pm instead of midnight is probably not a big deal compared with being able to double their lookback time for the same DB space requirements.

@spikecurtis
Copy link
Contributor

We will roll up the data into a format that is queryable and more efficient to store. The rollup will happen on an hourly or daily schedule.

We should presumably roll up the data on the same periodicity as the rows. That is, as soon as we are past the end_time we could next write, we start the rollup.

How will we deconflict between multiple Coderd instances trying to do the rollup? I.e. use of an advisory lock?

@mafredri
Copy link
Member Author

@spikecurtis can you also provide a motivation? I don't think necessity is a good enough reason to diverge from a common database pattern. I wasn't going for ultimate compression here, but similarly, if this was sparked by the 8-byte saving, then we could just as well change the end time to tinyint to save another 6-bytes. I decided against it so that data is more readable, and may help with query simplicity as well.

(I also find IDs to be a good safety net in the event that something goes wrong.)

@spikecurtis
Copy link
Contributor

@spikecurtis can you also provide a motivation? I don't think necessity is a good enough reason to diverge from a common database pattern. I wasn't going for ultimate compression here, but similarly, if this was sparked by the 8-byte saving, then we could just as well change the end time to tinyint to save another 6-bytes. I decided against it so that data is more readable, and may help with query simplicity as well.

The reason we have id columns in other parts of the database is that the other identifying columns might be mutable (i.e. a user could change their name). Even if we, say, don't allow people to change names, we might lift that requirement in future, and reworking the primary key is a big pain. Having an id column is common because these types of rows are common. It's not a pattern that we should just follow because it's a pattern.

In the rollups case, it simply does not make sense for (start, end, template ID, user ID) to change, since that would change the meaning of the row (template ID and user ID by definition won't be modified). Thus, we have a set that is unique, identifying, and won't be modified: exactly what you want for a primary key.

Adding an autoincrementing id column doesn't serve any purpose, and in fact, muddies the waters because it makes any insert have a unique primary key even if that same logical piece of data already exists.

(I also find IDs to be a good safety net in the event that something goes wrong.)

I'll flip this one back to you and ask, how you think it will help in the event that something goes wrong?

@mafredri
Copy link
Member Author

@mtojek

Should we keep separate columns for these *_mins fields or is it more clean to have generic mins + type field? My main concern with what you proposed is the necessity of modifying the schema to add a column if we decide to support next native application.

It's a valid concern for huge tables since adding a column requires rewriting the table (doubling the storage space requirement temporarily). However, if we set the default retention to a conservative number (6m-1y?) and document that a large number there may eventually affect migration speed, that could be an OK approach.

When splitting these out into a separate table I'd be slightly worried about the increased storage space requirement and effect on query speed.

Initially I considered keeping all of these "mins" in the jsonb field, but that would've increased the storage requirement quite a bit.

If a customer feels this is too much, we can make the retention configurable.

I would consider it to be MVP requirement. Different customers may have varying requirements and preferences regarding data retention.

With MVP, do you mean it should be part of the initial version? I'd think we have time to add it after-the-fact. It's basically just a periodic query to delete data.

Feature idea: would it be possible to build detailed monitoring based on this? Query all users that played with the application "Foobar" on Thursday afternoon.

That should be doable.

Note: These calculations don't take into account the the storage space used by whatever indexes we end up creating

Yeah, you may want to flush out more on the indexing strategy

I could guess, but at the end of the day we'll create the indexes to target performance, my best guesstimate is that this would add between 1.5x to 2.5x of storage requirement. But I can spend some time prototyping this if we want harder numbers.


@spikecurtis

We should presumably roll up the data on the same periodicity as the rows. That is, as soon as we are past the end_time we could next write, we start the rollup.

I think it's better to delay it a bit to ensure all the data we're interested in is present/written. We could even just do it 1/day and join it with the live data when showing "today".

How will we deconflict between multiple Coderd instances trying to do the rollup? I.e. use of an advisory lock?

Yeah, advisory lock is what I was thinking.

@mafredri
Copy link
Member Author

The reason we have id columns in other parts of the database is that the other identifying columns might be mutable (i.e. a user could change their name). Even if we, say, don't allow people to change names, we might lift that requirement in future, and reworking the primary key is a big pain. Having an id column is common because these types of rows are common. It's not a pattern that we should just follow because it's a pattern.

In the rollups case, it simply does not make sense for (start, end, template ID, user ID) to change, since that would change the meaning of the row (template ID and user ID by definition won't be modified). Thus, we have a set that is unique, identifying, and won't be modified: exactly what you want for a primary key.

Adding an autoincrementing id column doesn't serve any purpose, and in fact, muddies the waters because it makes any insert have a unique primary key even if that same logical piece of data already exists.

I get what you're saying, and I agree that each row can be uniquely identified (by even fewer columns: (start, template, user)). This unique constraint is actually part of the design (even if omitted in the issue description).

As I understand it, your reasoning is still mainly that it's not used so we should remove it(?). My POV on the other hand is that of future extensibility, if we need to reference back to this table I'd rather have the ID in place than need to add it after-the-fact. I also think the argument for "muddied waters" is somewhat moot given the extensibility and unique constraint. It's of course possible to create foreign keys against multiple columns, but in this case all columns referenced must be duplicated in the other table.

I'll flip this one back to you and ask, how you think it will help in the event that something goes wrong?

I'd like to say "you'll know it when you need it", but one contrived example would be to deploy a version of Coder that writes bad data to the table, after update to a fix, it would be possible to clean up the bad data between IDs X and Y just by figuring out the starting point and the end point. For arguments sake we'll say the data is good at masking itself as legit data too, haha.

It can also tell you the ~order of inserts, in case you spot something amiss.

@mtojek
Copy link
Member

mtojek commented Feb 22, 2024

However, if we set the default retention to a conservative number (6m-1y?) and document that a large number there may eventually affect migration speed, that could be an OK approach.

That should work 👍 Thanks for the answer!

With MVP, do you mean it should be part of the initial version? I'd think we have time to add it after-the-fact. It's basically just a periodic query to delete data.

We don't know where we land with our userbase limits, so I wanted to improve it as we have a hardcoded retention limit now. I'm ok with doing this as post-MVP too. I just assumed that it might be super easy to change.

But I can spend some time prototyping this if we want harder numbers.

I suspect that we will need similar indices to what we have now, but wanted to confirm it. This issue thread is ~mini RFC, so listing them might be useful to present the full scope. I'm fine too if you want to introduce them afterward. Let's just document them somewhere.

@spikecurtis
Copy link
Contributor

spikecurtis commented Feb 22, 2024

As I understand it, your reasoning is still mainly that it's not used so we should remove it(?). My POV on the other hand is that of future extensibility, if we need to reference back to this table I'd rather have the ID in place than need to add it after-the-fact. I also think the argument for "muddied waters" is somewhat moot given the extensibility and unique constraint. It's of course possible to create foreign keys against multiple columns, but in this case all columns referenced must be duplicated in the other table.

Yeah, it's not used so we shouldn't have it. I'm not sure what sounds unreasonable about that. It's not a big deal to add later if we find an actual use for it.

@mafredri
Copy link
Member Author

Yeah, it's not used so we shouldn't have it. I'm not sure what sounds unreasonable about that. It's not a big deal to add later if we find an actual use for it.

As per the discussion above, we probably shouldn't make changes to this table lightly as it can be quite big (and rewriting takes time/space). But we can leave it out for now 👍🏻.

@mtojek
Copy link
Member

mtojek commented Mar 26, 2024

Hey @mafredri! It would be awesome to share any updates on what you have accomplished and what is still missing. I don't see any checklist so we can ensure transparency regarding what is happening right now.

@mafredri
Copy link
Member Author

mafredri commented Mar 26, 2024

@mtojek absolutely!

With the following three PRs, we are now producing rolled up template usage data (e.g. insights) in the template_usage_stats table:

Every 5 minutes, we refresh the latest 1-2 hours of data so that we take into consideration any data arriving late, but after that the stats are not touched. This also means that we have a 0-5 minute delay on "real-time" data for insights.

With the following query rewrites, we are using the rolled up data (template_usage_stats):

Our Prometheus exporter also produces insights, however, these can't use the rolled up data due to using a 5 minute window. As such, the queries were simply updated to match any changes introduced in the above feature/refactor work:

There are two tasks remaining:

As a final piece of the puzzle, we will delete old workspace agent stats. Given that the data is rolled up, we will only keep 1 day of data. (It's the largest table in the database and will release 7 GB of storage in our dogfood environment, for reference the next largest table is 472 MB.)

PS. The new template_usage_stats is 10 MB at the time of writing, after rolling up all the data (in dogfood).

@mtojek
Copy link
Member

mtojek commented Apr 2, 2024

Implementation done 👍 Credits to @mafredri .

We'd like to keep #12674 open for a while, but don't need to keep the meta issue anymore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants