-
Notifications
You must be signed in to change notification settings - Fork 963
chore: add usage tracking package #19095
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This stack of pull requests is managed by Graphite. Learn more about stacking. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some non-blocking comments below but might need to take another pass.
const ( | ||
CoderLicenseJWTHeader = "Coder-License-JWT" | ||
|
||
tallymanURL = "https://tallyman-ingress.coder.com" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see an argument for making this a var
so it's configurable at build time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we do need that we can add it very easily in the future.
if !allFailed { | ||
// These are all going to have the same message, so don't log | ||
// them. We already logged the overall error above. | ||
p.log.Warn(ctx, "tallyman rejected usage event", slog.F("id", event.ID), slog.F("message", rejectedEvent.Message), slog.F("permanent", rejectedEvent.Permanent)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we instead collect the failed IDs and warn once? Depending on the number of events, this could spam.
} else { | ||
// It's not good if this path gets hit, but we'll handle it as if it | ||
// was a temporary rejection. | ||
p.log.Warn(ctx, "tallyman did not include a usage event in the response, considering it temporarily rejected", slog.F("id", event.ID)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above, I'd prefer if we didn't log in a tight loop.
err = p.db.UpdateUsageEventsPostPublish(ctx, dbUpdate) | ||
if err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: suggest
if err := p.b.UpdateUsageEventsPostPublish(ctx, dbUpdate); err != nil { ... }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not enforced by the formatter and I prefer it to be two lines, so I'm going to keep it like this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Obligatory reminder to check migration number before merging!
|
||
# Usage tracking code requires intimate knowledge of Tallyman and Metronome, as | ||
# well as guidance from revenue. | ||
coderd/usage/ @deansheather |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in general, I'd like to ensure we have 2 code owners for each thing, so that we have some ability to make progress if someone is out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think codeowners are required to approve PRs for merge, they just get auto-requested for review. I just want to be requested on every PR so I have visibility into changes. If I'm out, anyone else can approve.
Most other codeowners in this file have only one person listed and function the same way.
// Note that the following event types should not be updated once they are | ||
// merged into the product. Please consult Dean before making any changes. | ||
type Event interface { | ||
usageEvent() // to prevent external types from implementing this interface |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the motivation for preventing external types from implementing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be very bad if somehow event types were added to an external package by mistake and started getting used, since Tallyman would run into issues using them. If we ever do want to add implementations of this interface outside of this package, we can do so after assessing the impacts and removing this dummy method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense --- tallyman imports this package and this package only to determine what events to care about. Can you add a comment?
// the count of all existing managed agents (count=N) | ||
// - A new managed agent is created (count=1) | ||
type DCManagedAgentsV1 struct { | ||
Count uint64 `json:"count"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like we'll want a lot more than just the count so that customers can understand their usage, e.g. by template, organization, user.
startErr := make(chan error) | ||
go func() { | ||
err := publisher.Start() | ||
testutil.RequireSend(ctx, t, startErr, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't call Require
methods in goroutines, only the main test goroutine.
handler func(req usage.TallymanIngestRequestV1) any | ||
) | ||
ingestURL := fakeServer(t, tallymanHandler(t, licenseJWT, func(req usage.TallymanIngestRequestV1) any { | ||
callCount := atomic.AddInt64(&calls, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In order to prevent test races, we need to ensure that the publish calls are completed before we attempt to read this value. If the test prevents such races, atomic
is superfluous because we will not concurrently access it. Making it atomic doesn't do anything to prevent racy tests and is confusing because it signals to people that there might be concurrent accesses, when we need to ensure there are none.
// publishOnce publishes up to tallymanPublishBatchSize usage events to | ||
// tallyman. It returns the number of successfully published events. | ||
func (p *tallymanPublisher) publishOnce(ctx context.Context, deploymentID uuid.UUID) (int, error) { | ||
licenseJwt, err := p.getBestLicenseJWT(ctx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't be doing this for every publish. Can we connect it up to the rest of the license code so we don't add more queries? Or at least use the same query strategy of querying every 10 minutes and listening for published changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't be doing this for every publish.
This is very light so I don't see the issue with querying it every time. The reason why the entitlements check isn't performed on demand and is cached is that it's used on almost every single request to the API. In general, this is being done a few times every 17 minutes per replica. The licenses table is also very small.
Can we connect it up to the rest of the license code so we don't add more queries?
The license code doesn't actually return much information about licenses themselves, or perform any license selection that matches our requirements. Quite frankly it's in desperate need of a rewrite, but it's out of scope for this PR.
Not used in coderd yet, see stack.
Adds two new packages:
coderd/usage
: provides an interface for the "Collector" as well as a stub implementation for AGPLenterprise/coderd/usage
: provides an interface for the "Publisher" as well as a Tallyman implementationRelates to coder/internal#814