Skip to content

feat: propagate x-ray trace id to event bridge targets #12481

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 24 commits into from
Apr 14, 2025

Conversation

maxhoheiser
Copy link
Member

@maxhoheiser maxhoheiser commented Apr 4, 2025

Motivation

Currently, when using X-Ray tracing with EventBridge, the trace context is not properly propagated to the target services. This makes it difficult to trace end-to-end requests across multiple AWS services in LocalStack. By propagating the X-Ray trace ID from the original request to EventBridge targets, we can provide a complete trace of the request flow, which is essential for debugging and monitoring distributed applications.
We can make use of the already correctly insturmented internal botocore client to propagate existing X-Ray trace ID headers.
For now only Lambda and API Gateway support X-Ray trace ID propagation.

X-Ray trace header concept

Changes

  • Set new X-Ray trace ID or new parent if present by creating a new Span
  • Modified the EventBridge target processing to include X-Ray trace ID when invoking target services with not already instrumented botocore client
  • Ensured compatibility with the existing trace header encoding mechanism

Testing

  • test_xray_trace_propagation_events_lambda
  • test_xray_trace_propagation_events_api_gateway
  • test_xray_trace_propagation_events_events

we need to skip a schedule rate test tests_schedule_rate_custom_input_target_sqs since the check of strict execution time of 60 seconds seams to be flaky

@maxhoheiser maxhoheiser added aws:events Amazon EventBridge semver: minor Non-breaking changes which can be included in minor releases, but not in patch releases labels Apr 4, 2025
@maxhoheiser maxhoheiser requested a review from joe4dev April 4, 2025 09:35
@maxhoheiser maxhoheiser self-assigned this Apr 4, 2025
@maxhoheiser maxhoheiser requested a review from bentsku as a code owner April 4, 2025 09:35
Copy link
Contributor

@bentsku bentsku left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great idea, this is going to help a lot!

I'm really sorry the API Gateway implementation was available "externally", this was implemented before @joe4dev implemented a really great way to access this data directly in the RequestContext with this PR: #11708.

For context, API Gateway logic happens outside of a regular service provider, which is why it needs custom logic, as the edge routes don't have access to the RequestContext.

But events can access the RequestContext.trace_context which already contains everything done here, and Lambda is a good example on how to use it.
You can access the X-Ray header via context.trace_context["aws_trace_header"]: I believe it should always be generated.

Additionally, do you think it's possible to add a test for this, maybe by re-using existing API Gateway tests? This feels a bit tricky as there are many steps along the way, events to API Gateway (to maybe lambda..., but not mandatory). But if we were to send an X-Ray header to the events.put_events call, some parts of it should be the same from the APIGW invocation for example.
By re-using the test test_xray_trace_propagation but adapt it to events, and using some parts of TestApiGatewayCommon.test_invocation_trace_id with a MOCK integration (and no lambda, I think it's not fully implemented yet in APIGW) we could test that we always receive a trace id, generated if not provided, and parts of it similar if provided?

Once this is addressed, I believe this is good to go 🚀

edit: oops, it seems this introduced an issue, lots of tests are failing 👀

@@ -1809,6 +1810,10 @@ def _process_entry(
return

region, account_id = extract_region_and_account_id(event_bus_name_or_arn, context)

trace_id = populate_trace_id(context.request.headers)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can replace that with the context trace, see the main review comment.

Suggested change
trace_id = populate_trace_id(context.request.headers)
trace_id = context.trace_context["aws_trace_header"]


trace_id = populate_trace_id(context.request.headers)

# TODO check interference with x-ray trace header
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: how could those clash? could you quickly explain to me, I'm not familiar with this 😅

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am piggybacking on the trace header for intermediary storing region and account for cross-region cross-account events

@maxhoheiser maxhoheiser force-pushed the feature/events/add-xray-trace-header-propagation branch 2 times, most recently from 1ce74d4 to 77190ae Compare April 4, 2025 12:33
@maxhoheiser maxhoheiser changed the title feat: propagate x-ray trace id feat: propagate x-ray trace id to event bridge targets Apr 7, 2025
Copy link

github-actions bot commented Apr 7, 2025

LocalStack Community integration with Pro

    2 files  ±    0      2 suites  ±0   1h 29m 20s ⏱️ - 23m 46s
3 174 tests  - 1 178  2 930 ✅  - 1 076  244 💤  - 102  0 ❌ ±0 
3 176 runs   - 1 178  2 930 ✅  - 1 076  246 💤  - 102  0 ❌ ±0 

Results for commit 0d4c4e0. ± Comparison against base commit 1724451.

This pull request removes 1183 and adds 5 tests. Note that renamed tests count towards both.
tests.aws.scenario.bookstore.test_bookstore.TestBookstoreApplication ‑ test_lambda_dynamodb
tests.aws.scenario.bookstore.test_bookstore.TestBookstoreApplication ‑ test_opensearch_crud
tests.aws.scenario.bookstore.test_bookstore.TestBookstoreApplication ‑ test_search_books
tests.aws.scenario.bookstore.test_bookstore.TestBookstoreApplication ‑ test_setup
tests.aws.scenario.kinesis_firehose.test_kinesis_firehose.TestKinesisFirehoseScenario ‑ test_kinesis_firehose_s3
tests.aws.scenario.lambda_destination.test_lambda_destination_scenario.TestLambdaDestinationScenario ‑ test_destination_sns
tests.aws.scenario.lambda_destination.test_lambda_destination_scenario.TestLambdaDestinationScenario ‑ test_infra
tests.aws.scenario.loan_broker.test_loan_broker.TestLoanBrokerScenario ‑ test_prefill_dynamodb_table
tests.aws.scenario.loan_broker.test_loan_broker.TestLoanBrokerScenario ‑ test_stepfunctions_input_recipient_list[step_function_input0-SUCCEEDED]
tests.aws.scenario.loan_broker.test_loan_broker.TestLoanBrokerScenario ‑ test_stepfunctions_input_recipient_list[step_function_input1-SUCCEEDED]
…
tests.aws.services.events.test_x_ray_trace_propagation ‑ test_xray_trace_propagation_events_api_gateway
tests.aws.services.events.test_x_ray_trace_propagation ‑ test_xray_trace_propagation_events_events[bus_combination0]
tests.aws.services.events.test_x_ray_trace_propagation ‑ test_xray_trace_propagation_events_events[bus_combination1]
tests.aws.services.events.test_x_ray_trace_propagation ‑ test_xray_trace_propagation_events_events[bus_combination2]
tests.aws.services.events.test_x_ray_trace_propagation ‑ test_xray_trace_propagation_events_lambda

♻️ This comment has been updated with latest results.

@maxhoheiser maxhoheiser force-pushed the feature/events/add-xray-trace-header-propagation branch from f8f11fa to 394dd8f Compare April 7, 2025 15:18
@maxhoheiser maxhoheiser requested a review from bentsku April 7, 2025 20:14
Copy link
Contributor

@bentsku bentsku left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, this is looking nice and it's great to see the tests are green! 🚀

I have some concerns about using the xray_recorder global variable here. Would this start falling if something else was using it? Is it the right solution and what issue is it solving?
Should the TraceHeader be extended instead to allow generating "child"/"segment" from the existing TraceHeader in the context?

Also have some concerns about registering/deregistering for every call of the botocore client, I'm not sure what can be done to not affect all the clients, but also be maybe more lightweight if needed, depending on the performance cost.

Sorry to delay this implementation, just want to make sure we're not introducing memory leaks/performance issues with this.

Thanks!



def create_segment_from_trace_header(trace_header: TraceHeader) -> None:
segment = xray_recorder.begin_segment(name="events.put_events")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: what is the goal of this? Wouldn't we start a "recording" with this? It also seems we're never closing the segment. Could this be an issue?

Could you explain what this is doing exactly? is this to not fully rebuild the trace header like APIGW is doing?

request.headers["X-Amzn-Trace-Id"] = trace_header_str

event_name = "before-send.lambda.*"
self.client.meta.events.register(event_name, add_xray_header)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you know what's the impact of registering and deregistering? I believe maybe @dfangl or @gregfurman had some issues with this at some point, doesn't it have quite a high performance cost?

Also, if the invoke() call fails, we won't deregister here, this should be wrapped in a try...finally probably?

I think generally, we might want to have proper instrumentation similar to the IAM instrumentation of the client, to have a way to add the X-Ray header to all connect_to client? We have the same issue in API Gateway to forward the X-Ray header to the Invoke call of the AWS_PROXY implementation. But this is out of scope, it's just that we're seeing this happen more and more often now 🤔

Until we have a more general approach, maybe we could take inspiration from @gregfurman implementation for the internal SQS header:

Not sure if that's the right way for our use-case here. Open to all suggestions! if @joe4dev also has ideas?


def get_trace_header_str_from_segment():
# Get the current segment
segment = xray_recorder.current_segment()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: what if we import the xray_recorder in the code somewhere else and start using it? wouldn't this start failing weirdly? as xray_recorder is a global variable?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice for the tests! Happy to see them validated now, thanks for doing it!

@maxhoheiser maxhoheiser force-pushed the feature/events/add-xray-trace-header-propagation branch 5 times, most recently from 27fbb44 to 6475be1 Compare April 11, 2025 09:59
@maxhoheiser maxhoheiser requested a review from bentsku April 11, 2025 11:05
Copy link
Contributor

@bentsku bentsku left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for improving the client logic.

I think we will need to find a global solution for the propagation of traces between services, as we're dealing it case per case for now.

This update also required a lot of function signature changes. Either using thread local context / context var, or passing down a "context" object holding the data you need in the interim, so you just need to update some kind of dictionary/dataclass and wouldn't need to update the function signature anymore. But still not optimal 😄

But as we need this right now, I think it's an acceptable way of going until we find a better way. Maybe we could have made it a bit more flexible by using **kwargs, but I think this should work!

We just need to update the signature of the upstream implementation of ECSTargetSender and we're good to go! thanks for pushing through this one! 🚀

Copy link
Contributor

@bentsku bentsku left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Just waiting to see upstream tests being green, had some issues with the run but seems to be going now 🚀

@maxhoheiser maxhoheiser force-pushed the feature/events/add-xray-trace-header-propagation branch from f2cfaae to 6bf8cc1 Compare April 14, 2025 09:40
@maxhoheiser maxhoheiser force-pushed the feature/events/add-xray-trace-header-propagation branch from ee63673 to 0d4c4e0 Compare April 14, 2025 12:06
@maxhoheiser maxhoheiser merged commit 5ad7f8e into master Apr 14, 2025
31 checks passed
@maxhoheiser maxhoheiser deleted the feature/events/add-xray-trace-header-propagation branch April 14, 2025 14:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
aws:events Amazon EventBridge semver: minor Non-breaking changes which can be included in minor releases, but not in patch releases
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants