Skip to content

chore: add usage tracking package #19095

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

deansheather
Copy link
Member

@deansheather deansheather commented Jul 30, 2025

Not used in coderd yet, see stack.

Adds two new packages:

  • coderd/usage: provides an interface for the "Collector" as well as a stub implementation for AGPL
  • enterprise/coderd/usage: provides an interface for the "Publisher" as well as a Tallyman implementation

Relates to coder/internal#814

Copy link
Member Author

deansheather commented Jul 30, 2025

@deansheather deansheather marked this pull request as ready for review July 30, 2025 06:05
Copy link
Member

@johnstcn johnstcn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some non-blocking comments below but might need to take another pass.

const (
CoderLicenseJWTHeader = "Coder-License-JWT"

tallymanURL = "https://tallyman-ingress.coder.com"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see an argument for making this a var so it's configurable at build time.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do need that we can add it very easily in the future.

if !allFailed {
// These are all going to have the same message, so don't log
// them. We already logged the overall error above.
p.log.Warn(ctx, "tallyman rejected usage event", slog.F("id", event.ID), slog.F("message", rejectedEvent.Message), slog.F("permanent", rejectedEvent.Permanent))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we instead collect the failed IDs and warn once? Depending on the number of events, this could spam.

} else {
// It's not good if this path gets hit, but we'll handle it as if it
// was a temporary rejection.
p.log.Warn(ctx, "tallyman did not include a usage event in the response, considering it temporarily rejected", slog.F("id", event.ID))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, I'd prefer if we didn't log in a tight loop.

Comment on lines +297 to +298
err = p.db.UpdateUsageEventsPostPublish(ctx, dbUpdate)
if err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: suggest

if err := p.b.UpdateUsageEventsPostPublish(ctx, dbUpdate); err != nil { ... }

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not enforced by the formatter and I prefer it to be two lines, so I'm going to keep it like this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Obligatory reminder to check migration number before merging!


# Usage tracking code requires intimate knowledge of Tallyman and Metronome, as
# well as guidance from revenue.
coderd/usage/ @deansheather
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in general, I'd like to ensure we have 2 code owners for each thing, so that we have some ability to make progress if someone is out

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think codeowners are required to approve PRs for merge, they just get auto-requested for review. I just want to be requested on every PR so I have visibility into changes. If I'm out, anyone else can approve.

Most other codeowners in this file have only one person listed and function the same way.

// Note that the following event types should not be updated once they are
// merged into the product. Please consult Dean before making any changes.
type Event interface {
usageEvent() // to prevent external types from implementing this interface
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the motivation for preventing external types from implementing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be very bad if somehow event types were added to an external package by mistake and started getting used, since Tallyman would run into issues using them. If we ever do want to add implementations of this interface outside of this package, we can do so after assessing the impacts and removing this dummy method

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense --- tallyman imports this package and this package only to determine what events to care about. Can you add a comment?

// the count of all existing managed agents (count=N)
// - A new managed agent is created (count=1)
type DCManagedAgentsV1 struct {
Count uint64 `json:"count"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we'll want a lot more than just the count so that customers can understand their usage, e.g. by template, organization, user.

startErr := make(chan error)
go func() {
err := publisher.Start()
testutil.RequireSend(ctx, t, startErr, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't call Require methods in goroutines, only the main test goroutine.

handler func(req usage.TallymanIngestRequestV1) any
)
ingestURL := fakeServer(t, tallymanHandler(t, licenseJWT, func(req usage.TallymanIngestRequestV1) any {
callCount := atomic.AddInt64(&calls, 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to prevent test races, we need to ensure that the publish calls are completed before we attempt to read this value. If the test prevents such races, atomic is superfluous because we will not concurrently access it. Making it atomic doesn't do anything to prevent racy tests and is confusing because it signals to people that there might be concurrent accesses, when we need to ensure there are none.

// publishOnce publishes up to tallymanPublishBatchSize usage events to
// tallyman. It returns the number of successfully published events.
func (p *tallymanPublisher) publishOnce(ctx context.Context, deploymentID uuid.UUID) (int, error) {
licenseJwt, err := p.getBestLicenseJWT(ctx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't be doing this for every publish. Can we connect it up to the rest of the license code so we don't add more queries? Or at least use the same query strategy of querying every 10 minutes and listening for published changes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't be doing this for every publish.

This is very light so I don't see the issue with querying it every time. The reason why the entitlements check isn't performed on demand and is cached is that it's used on almost every single request to the API. In general, this is being done a few times every 17 minutes per replica. The licenses table is also very small.

Can we connect it up to the rest of the license code so we don't add more queries?

The license code doesn't actually return much information about licenses themselves, or perform any license selection that matches our requirements. Quite frankly it's in desperate need of a rewrite, but it's out of scope for this PR.

@github-actions github-actions bot added the stale This issue is like stale bread. label Aug 7, 2025
@github-actions github-actions bot closed this Aug 10, 2025
@deansheather deansheather reopened this Aug 12, 2025
@deansheather deansheather removed the stale This issue is like stale bread. label Aug 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants