-
Notifications
You must be signed in to change notification settings - Fork 112
fix: don't wait for telemetry events MCP-179 #521
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR optimizes telemetry performance by making telemetry event emission non-blocking and asynchronous. The changes prevent the server from waiting for telemetry events to be sent before continuing with normal operations.
- Make telemetry event emission non-blocking by removing await keywords
- Add event emitters to track telemetry completion states
- Introduce logging for tool execution completion
Reviewed Changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.
Show a summary per file
File | Description |
---|---|
tests/unit/telemetry.test.ts | Adds helper function to properly test async telemetry events |
src/transports/streamableHttp.ts | Improves logger usage and error message clarity |
src/tools/tool.ts | Makes tool telemetry non-blocking and adds execution logging |
src/telemetry/telemetry.ts | Converts telemetry emission to non-blocking with event emitters |
src/server.ts | Removes blocking telemetry call |
src/common/connectionManager.ts | Adds explicit type annotation for events property |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
src/telemetry/telemetry.ts
Outdated
@@ -71,34 +90,22 @@ export class Telemetry { | |||
|
|||
public async close(): Promise<void> { | |||
this.isBufferingEvents = false; | |||
await this.emitEvents(this.eventCache.getEvents()); | |||
await timeout(this.emit([]), 5_000); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The timeout
function is called with an empty array []
instead of the cached events. This should be await timeout(this.emit(this.eventCache.getEvents()), 5_000);
to ensure cached events are sent during cleanup.
await timeout(this.emit([]), 5_000); | |
await timeout(this.emit(this.eventCache.getEvents()), 5_000); |
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is wrong. We're already emitting the cached events in emit
, so passing this.eventCache.getEvents()
would have made it so they're emitted twice.
src/telemetry/telemetry.ts
Outdated
async function timeout(promise: Promise<unknown>, ms: number): Promise<void> { | ||
await Promise.race([new Promise((resolve) => setTimeout(resolve, ms)), promise]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The timeout
function doesn't actually enforce a timeout - it just waits for whichever promise resolves first. Consider using Promise.race
with a rejection timeout to properly handle cases where the promise takes longer than the specified milliseconds.
async function timeout(promise: Promise<unknown>, ms: number): Promise<void> { | |
await Promise.race([new Promise((resolve) => setTimeout(resolve, ms)), promise]); | |
await Promise.race([ | |
promise, | |
new Promise((_, reject) => setTimeout(() => reject(new Error(`Operation timed out after ${ms} ms`)), ms)) | |
]); |
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This suggestion is actually good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really like it. The reason why I chose to resolve the promise is because I don't think close()
should throw if we fail to emit the events on time. I'll add a comment explaining that and remove the function as it's just a 1-liner anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem with this approach is that the timeout is only relevant for waiting, but the actual operation runs in the background (I don't see any usage of an AbortController here). At least the error would show in our logs even if we don't cancel the long-running op.
If you want to keep the resolution as is, I would at least print a log, so timeouts are not invisible. One of the reasons we are implementing this is because we think that telemetry is taking too much time at the end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you check that this timeout isn't going to lead to the process being held up?
adding .unref()
https://nodejs.org/api/timers.html#timeoutunref just to be safe would be good to.
Pull Request Test Coverage Report for Build 17497369294Warning: This coverage report may be inaccurate.This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.
Details
💛 - Coveralls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure when this regressed but I remember this exact setup with the track functions not being async being the case in the repo before.
Definitely how we should approach it when doing this on a more generalized level.
Proposed changes
While looking at some customer logs, I noticed we're likely being super inefficient by waiting for telemetry events to be sent before we can continue with normal server tasks.
Additionally, added a log for after tool execution.
Finally, it fixes the race condition reported in MCP-179.