-
Notifications
You must be signed in to change notification settings - Fork 886
feat: Move from datadog to generic otel #1567
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@kylecarbs looking for initial thoughts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few q's. How does this play into database tracing stuff with DataDog?
telemetry/exporter.go
Outdated
tracerProvider := sdktrace.NewTracerProvider( | ||
sdktrace.WithBatcher(exporter), | ||
sdktrace.WithResource(res), | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where does this actually send data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can add a comment, but they get shipped to a local otel exporter by default. You can change where it gets shipped with otel env vars like OTEL_COLLECTOR_ENDPOINT
and such, so we don't need to write any code to support the configuration. The datadog agent we have deployed on dev.coder.com
has a collector built in, so this will automatically start sending the otel data there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh I see! Even just an external link would help. I was expecting --trace
to accept an endpoint to write traces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
telemetry/httpmw.go
Outdated
@@ -0,0 +1,59 @@ | |||
package telemetry |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the name telemetry
might be confusing. I think tracing
would be very reasonable. We might want to place this in coderd/tracing
, because I'm sure we'll probably only trace there for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was on the fence between the two
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I put this higher level BECAUSE I didn't want to exclude the provisioner apps. Those reasonably could be traced as well, and those distributed traces would be connected in otel with the right glue code.
That being said, we are not currently using any tracing in the other apps, so I'm also fine to move to coderd
if we want to isolate it for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we isolate for now. Nothing else has an HTTP server, and if that's our primary point of tracing I'd be hesitant to say we'll do it for other things soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
telemetry/httpmw.go
Outdated
) | ||
|
||
// HTTPMW adds tracing to http routes. | ||
func HTTPMW(tracerProvider *sdktrace.TracerProvider, name string) func(http.Handler) http.Handler { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like how you didn't wrap the *sdktrace.TraceProvider
type. Very simple to process!
This doesn't do DB tracing yet, that never got merged with DD either. I've taken a look and it seems pretty simple though. This will match parity (for the important info) in datadog so it doesn't disrupt the monitoring. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Had a few thoughts in addition to the comments:
- Do we currently inform the user where traces are sent (when
--trace
is enabled), should we? - (Not this PR) I'm not seeing a panic handler, should we add one and make sure it plays nice with tracing?
coderd/tracing/httpmw.go
Outdated
func HTTPMW(tracerProvider *sdktrace.TracerProvider, name string) func(http.Handler) http.Handler { | ||
return func(next http.Handler) http.Handler { | ||
return http.HandlerFunc(func(rw http.ResponseWriter, r *http.Request) { | ||
// // do not trace if exporter has not be initialized |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// // do not trace if exporter has not be initialized | |
// do not trace if exporter has not be initialized |
coderd/tracing/httpmw.go
Outdated
} | ||
|
||
// start span with default span name. Span name will be updated once request finishes | ||
_, span := tracerProvider.Tracer(name).Start(r.Context(), "http.request") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: Not ideal, but we could consider the following fallback:
_, span := tracerProvider.Tracer(name).Start(r.Context(), "http.request") | |
_, span := tracerProvider.Tracer(name).Start(r.Context(), fmt.Sprintf("%s %s", r.Method, r.RequestURI)) |
My thought here is that if we want to combine logging with otel, either by using otel directly or integration via logging library, having the name set can provide additional context for logs that may be output before the handler completes.
cli/server.go
Outdated
logger.Warn(cmd.Context(), "failed to start telemetry exporter", slog.Error(err)) | ||
} else { | ||
defer func() { | ||
_ = tracerProvider.Shutdown(cmd.Context()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably don't want to use cmd.Context()
here since it is quite likely to already be cancelled, which would prevent Shutdown from waiting for flush. Perhaps we should create a new context with a timeout instead, say 5 seconds?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, that makes sense
Yes I want to have both of these things by community MVP |
This replaces the existing datadog tracing library with OpenTelemetry, which is a more open standard that more customers can integrate with.
I took a look at the middleware packages they use and they were pretty simple, so I've implemented the basic http information into a tracing middleware.