-
-
Notifications
You must be signed in to change notification settings - Fork 4.2k
feat: propagate x-ray trace id to event bridge targets #12481
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: propagate x-ray trace id to event bridge targets #12481
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great idea, this is going to help a lot!
I'm really sorry the API Gateway implementation was available "externally", this was implemented before @joe4dev implemented a really great way to access this data directly in the RequestContext
with this PR: #11708.
For context, API Gateway logic happens outside of a regular service provider, which is why it needs custom logic, as the edge routes don't have access to the RequestContext
.
But events
can access the RequestContext.trace_context
which already contains everything done here, and Lambda is a good example on how to use it.
You can access the X-Ray header via context.trace_context["aws_trace_header"]
: I believe it should always be generated.
Additionally, do you think it's possible to add a test for this, maybe by re-using existing API Gateway tests? This feels a bit tricky as there are many steps along the way, events to API Gateway (to maybe lambda..., but not mandatory). But if we were to send an X-Ray header to the events.put_events
call, some parts of it should be the same from the APIGW invocation for example.
By re-using the test test_xray_trace_propagation
but adapt it to events
, and using some parts of TestApiGatewayCommon.test_invocation_trace_id
with a MOCK integration (and no lambda, I think it's not fully implemented yet in APIGW) we could test that we always receive a trace id, generated if not provided, and parts of it similar if provided?
Once this is addressed, I believe this is good to go 🚀
edit: oops, it seems this introduced an issue, lots of tests are failing 👀
@@ -1809,6 +1810,10 @@ def _process_entry( | |||
return | |||
|
|||
region, account_id = extract_region_and_account_id(event_bus_name_or_arn, context) | |||
|
|||
trace_id = populate_trace_id(context.request.headers) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can replace that with the context trace, see the main review comment.
trace_id = populate_trace_id(context.request.headers) | |
trace_id = context.trace_context["aws_trace_header"] |
|
||
trace_id = populate_trace_id(context.request.headers) | ||
|
||
# TODO check interference with x-ray trace header |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: how could those clash? could you quickly explain to me, I'm not familiar with this 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am piggybacking on the trace header for intermediary storing region and account for cross-region cross-account events
1ce74d4
to
77190ae
Compare
LocalStack Community integration with Pro 2 files ± 0 2 suites ±0 1h 29m 20s ⏱️ - 23m 46s Results for commit 0d4c4e0. ± Comparison against base commit 1724451. This pull request removes 1183 and adds 5 tests. Note that renamed tests count towards both.
♻️ This comment has been updated with latest results. |
f8f11fa
to
394dd8f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, this is looking nice and it's great to see the tests are green! 🚀
I have some concerns about using the xray_recorder
global variable here. Would this start falling if something else was using it? Is it the right solution and what issue is it solving?
Should the TraceHeader
be extended instead to allow generating "child"/"segment" from the existing TraceHeader
in the context?
Also have some concerns about registering/deregistering for every call of the botocore client, I'm not sure what can be done to not affect all the clients, but also be maybe more lightweight if needed, depending on the performance cost.
Sorry to delay this implementation, just want to make sure we're not introducing memory leaks/performance issues with this.
Thanks!
|
||
|
||
def create_segment_from_trace_header(trace_header: TraceHeader) -> None: | ||
segment = xray_recorder.begin_segment(name="events.put_events") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: what is the goal of this? Wouldn't we start a "recording" with this? It also seems we're never closing the segment. Could this be an issue?
Could you explain what this is doing exactly? is this to not fully rebuild the trace header like APIGW is doing?
request.headers["X-Amzn-Trace-Id"] = trace_header_str | ||
|
||
event_name = "before-send.lambda.*" | ||
self.client.meta.events.register(event_name, add_xray_header) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you know what's the impact of registering and deregistering? I believe maybe @dfangl or @gregfurman had some issues with this at some point, doesn't it have quite a high performance cost?
Also, if the invoke()
call fails, we won't deregister here, this should be wrapped in a try...finally
probably?
I think generally, we might want to have proper instrumentation similar to the IAM instrumentation of the client, to have a way to add the X-Ray header to all connect_to
client? We have the same issue in API Gateway to forward the X-Ray header to the Invoke
call of the AWS_PROXY
implementation. But this is out of scope, it's just that we're seeing this happen more and more often now 🤔
Until we have a more general approach, maybe we could take inspiration from @gregfurman implementation for the internal SQS header:
localstack/localstack-core/localstack/services/lambda_/event_source_mapping/pollers/sqs_poller.py
Line 73 in 97bc2c7
def _register_client_hooks(self): |
Not sure if that's the right way for our use-case here. Open to all suggestions! if @joe4dev also has ideas?
|
||
def get_trace_header_str_from_segment(): | ||
# Get the current segment | ||
segment = xray_recorder.current_segment() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: what if we import the xray_recorder
in the code somewhere else and start using it? wouldn't this start failing weirdly? as xray_recorder
is a global variable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice for the tests! Happy to see them validated now, thanks for doing it!
27fbb44
to
6475be1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for improving the client logic.
I think we will need to find a global solution for the propagation of traces between services, as we're dealing it case per case for now.
This update also required a lot of function signature changes. Either using thread local context / context var, or passing down a "context" object holding the data you need in the interim, so you just need to update some kind of dictionary/dataclass and wouldn't need to update the function signature anymore. But still not optimal 😄
But as we need this right now, I think it's an acceptable way of going until we find a better way. Maybe we could have made it a bit more flexible by using **kwargs
, but I think this should work!
We just need to update the signature of the upstream implementation of ECSTargetSender
and we're good to go! thanks for pushing through this one! 🚀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Just waiting to see upstream tests being green, had some issues with the run but seems to be going now 🚀
Co-authored-by: Ben Simon Hartung <42031100+bentsku@users.noreply.github.com>
f2cfaae
to
6bf8cc1
Compare
ee63673
to
0d4c4e0
Compare
Motivation
Currently, when using X-Ray tracing with EventBridge, the trace context is not properly propagated to the target services. This makes it difficult to trace end-to-end requests across multiple AWS services in LocalStack. By propagating the X-Ray trace ID from the original request to EventBridge targets, we can provide a complete trace of the request flow, which is essential for debugging and monitoring distributed applications.
We can make use of the already correctly insturmented internal botocore client to propagate existing X-Ray trace ID headers.
For now only Lambda and API Gateway support X-Ray trace ID propagation.
X-Ray trace header concept
Changes
Testing
test_xray_trace_propagation_events_lambda
test_xray_trace_propagation_events_api_gateway
test_xray_trace_propagation_events_events
we need to skip a schedule rate test
tests_schedule_rate_custom_input_target_sqs
since the check of strict execution time of 60 seconds seams to be flaky