Skip to content

chore: add DRPC server implementation for network telemetry #13675

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jul 1, 2024

Conversation

deansheather
Copy link
Member

@deansheather deansheather commented Jun 26, 2024

  • Adds implementation for tailnetproto.DRPCTailnetServer PostTelemetry method
    • Telemetry gets batched into groups of 1000 events, or smaller batches every 60s.
  • Adds new types to telemetry package and converter functions

TODO:

  • Write tests
  • Sending a max-sized batch of 1000 should reset the batch timer
    • Maybe we could just discard events received past 1000 to avoid this problem entirely.

Copy link
Member

ethanndickson commented Jun 27, 2024

This stack of pull requests is managed by Graphite. Learn more about stacking.

Join @deansheather and the rest of your teammates on Graphite Graphite

@deansheather
Copy link
Member Author

Graphite messed up all the commits in my branch and now they're co-authored by Ethan... the actual commits were only written by me. When I do the merge I'll remove the co-authored by line.

@deansheather deansheather marked this pull request as ready for review June 28, 2024 07:17
@ethanndickson ethanndickson force-pushed the dean/drpc-network-telemetry-server branch from 937bf83 to 479df1f Compare June 28, 2024 07:23
Comment on lines 268 to 272
func (b *NetworkTelemetryBatcher) Close() error {
close(b.closed)
<-b.done
return nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably accept a context so it can't block forever, or move the waiting to a separate WaitUntilEmpty() or similar

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not going to make it accept a context, but I've added a 5s timeout and an error.

if b.ticker != nil {
b.ticker.Reset(b.frequency)
}
go b.batchFn(events)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels a bit racey. We should probably have a separate channel to signal to the other goroutine that it should tick.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't racey because we hold a lock. If a tick happens while we're handling messages, the pending events list will be emptied and the periodic batcher goroutine will do nothing because there are no events. If somehow we went above the batch size anyways and overflow, it'll just send two telemetry events in short succession which is acceptable. I've added a comment that the ticker reset is best effort

Copy link
Member

@ethanndickson ethanndickson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@deansheather deansheather merged commit 6c94dd4 into main Jul 1, 2024
28 checks passed
@deansheather deansheather deleted the dean/drpc-network-telemetry-server branch July 1, 2024 15:50
@github-actions github-actions bot locked and limited conversation to collaborators Jul 1, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants