Skip to content

X-Ray trace header parsing and Lambda tracing propagation #11708

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Oct 29, 2024

Conversation

joe4dev
Copy link
Member

@joe4dev joe4dev commented Oct 17, 2024

Motivation

We do not parse incoming X-Ray trace headers. Sample trace header (with Lambda lineage):

X-Amzn-Trace-Id: Root=1-5759e988-bd862e3fe1be46a994272793;Sampled=1;Lineage=a87bd80c:1|68fd508a:5|c512fbe3:2

Changes

Testing

Usage

Use context.trace_context within any provider to access the parsed X-Ray trace header. Check out the TraceHeader class for more details.

Create new TraceHeader:

TraceHeader()  # all fields are None !

h1 = TraceHeader().ensure_root_exists()
h1.to_header_str() == "Root=1-67126ec5-45ded1c6463bac5ef4be1d69"

h2 = TraceHeader(root="1-3152b799-8954dae64eda91bc9a23a7e8", parent="7fa8c0f79203be72", sampled=1)
h2.to_header_str() == "Root=1-3152b799-8954dae64eda91bc9a23a7e8;Parent=7fa8c0f79203be72;Sampled=1"

@joe4dev joe4dev added the semver: minor Non-breaking changes which can be included in minor releases, but not in patch releases label Oct 17, 2024
@joe4dev joe4dev added this to the 4.0 milestone Oct 17, 2024
@joe4dev joe4dev self-assigned this Oct 17, 2024
Copy link

github-actions bot commented Oct 17, 2024

S3 Image Test Results (AMD64 / ARM64)

  2 files    2 suites   3m 34s ⏱️
423 tests 369 ✅  54 💤 0 ❌
846 runs  738 ✅ 108 💤 0 ❌

Results for commit 006b34b.

♻️ This comment has been updated with latest results.

Copy link
Member Author

@joe4dev joe4dev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to hear your feedback about X-Ray trace propagation (presented in today's demo) :)

@@ -45,6 +45,7 @@ def __init__(self, service_manager: ServiceManager = None) -> None:
handlers.add_region_from_header,
handlers.rewrite_region,
handlers.add_account_id,
handlers.parse_trace_context,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bentsku @thrau What do you think about the ordering in the handler chain?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is good! At first, I thought it'd be nicer to have it before the edge routes, because we could also make use of this in API Gateway execute-api routes for example. However, we do not pass the RequestContext to the edge routes at the moment. (@thrau laments this design decision still today 😄) so there is no point in making it happen early for now. We can always move it later on if anything changes! But for now, this looks to me 👌

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Feel free to change later :)

@@ -57,6 +58,11 @@ class SQSInvocation:
exception_retries: int = 0

def encode(self) -> str:
# Encode TraceHeader
aws_trace_header_str = TraceHeader(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dfangl not super happy about this encoding/decoding hack. Feel free to share your suggestion.

This doesn't go well if aws_trace_header is not of type TraceHeader

@joe4dev joe4dev marked this pull request as draft October 18, 2024 07:47
Copy link

github-actions bot commented Oct 18, 2024

LocalStack Community integration with Pro

    2 files  ±0      2 suites  ±0   1h 42m 12s ⏱️ -12s
3 510 tests +1  3 097 ✅ +1  413 💤 ±0  0 ❌ ±0 
3 512 runs  +1  3 097 ✅ +1  415 💤 ±0  0 ❌ ±0 

Results for commit 006b34b. ± Comparison against base commit 4a775c4.

♻️ This comment has been updated with latest results.

@joe4dev joe4dev force-pushed the feature/lambda-xray-trace-propagation branch from d35c6d4 to 8fbd77c Compare October 18, 2024 14:27
@joe4dev joe4dev marked this pull request as ready for review October 18, 2024 17:06
@joe4dev joe4dev force-pushed the feature/lambda-xray-trace-propagation branch from d68d520 to 5a9693f Compare October 21, 2024 08:54
@joe4dev joe4dev force-pushed the feature/lambda-xray-trace-propagation branch from 0a44ee9 to 006b34b Compare October 22, 2024 08:55
@joe4dev joe4dev requested a review from bentsku October 25, 2024 14:53
Copy link
Contributor

@bentsku bentsku left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, this looks great! 🚀 I've only reviewed the parts that are not related to lambda, as I'm lacking a bit of context and knowledge there.

The handler looks great, I only have a few nits in the code taken from the AWS Python X-Ray SDK, as I think we could simplify it a bit by using new available Python standard lib utils and f-strings. Great work! 🚀

@@ -45,6 +45,7 @@ def __init__(self, service_manager: ServiceManager = None) -> None:
handlers.add_region_from_header,
handlers.rewrite_region,
handlers.add_account_id,
handlers.parse_trace_context,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is good! At first, I thought it'd be nicer to have it before the edge routes, because we could also make use of this in API Gateway execute-api routes for example. However, we do not pass the RequestContext to the edge routes at the moment. (@thrau laments this design decision still today 😄) so there is no point in making it happen early for now. We can always move it later on if anything changes! But for now, this looks to me 👌

Generate a random 16-digit hex str.
This is used for generating segment/subsegment id.
"""
return binascii.b2a_hex(os.urandom(8)).decode("utf-8")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is taken from AWS code, but I think we could use some Python utils to make it a bit nicer.

from secrets import token_hex

Suggested change
return binascii.b2a_hex(os.urandom(8)).decode("utf-8")
return token_hex(8)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice suggestions ✨ This looks much cleaner 🤩

I fully agree on a technical level. For licensing and maintenance reasons, I opted to minimize changes to the original AWS code here (it is easier to match the authoritative AWS implementation regarding header parsing). Do you think it's worth deviating here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose not, we could also keep the code as it is, if it makes it easier to port new changes as they come (not sure how often this happens?).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also crying every time I see %s%s%s%s%s 😭 . Nevertheless, I tend to prefer keeping the original for full parity (it's easier to relate to the AWS repo). Yeah, I don't expect direct changes here, maybe indirect ones for other X-Ray parts from the same source repo.

Generate a random trace id.
"""
self.start_time = int(time.time())
self.__number = binascii.b2a_hex(os.urandom(12)).decode("utf-8")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: not sure about the double __ here, should we just use a single underscore?
And we can also use token_hex instead of binascii and os.random

Comment on lines +32 to +38
return "%s%s%s%s%s" % (
TraceId.VERSION,
TraceId.DELIMITER,
format(self.start_time, "x"),
TraceId.DELIMITER,
self.__number,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably use f-strings here, so we could even avoid the call to format:

Suggested change
return "%s%s%s%s%s" % (
TraceId.VERSION,
TraceId.DELIMITER,
format(self.start_time, "x"),
TraceId.DELIMITER,
self.__number,
)
return f"{TraceId.VERSION}{TraceId.DELIMITER}{self.start_time:x}{TraceId.DELIMITER}{self._number}"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great 👌

Copy link
Member

@dominikschubert dominikschubert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, most points have already been discussed and the current state seems very well documented and thought through. There's a few things we still need to tackle in the future like the sampling strategies but I actually quite like that we leave that out for now and can delegate this in the future.

Great work on re-using existing components. It makes a lot of sense to stay close to the official tooling here to make sure our behavior isn't diverging from AWS.

In terms of adjusting the internalized code from the xray lib, I'd generally prefer to stay closer to the original code if possible but as long as it doesn't impede porting new changes too much, it's of course also ok to modernize it a bit.

)
# The Lambda RIE requires a full tracing header including Root, Parent, and Samples. Otherwise, tracing fails
# with the warning "Subsegment ## handler discarded due to Lambda worker still initializing"
aws_trace_header.ensure_sampled_exists()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so if I understand that right, for lambda it's always Sampled=1 while other services will by default get a Sampled=0? 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, the sampled field is missing altogether (i.e., not present rather than sampled=0) for other services if no incoming tracing header is present with the sampled field.

That's not ideal, but this depends on the (service-dependent) sampling strategies and other services. We're lacking parity tests here, so idk what's the best behavior. Just leaving it open for now ...

# TODO: replace this random parent id with actual parent segment created within the Lambda provider using X-Ray
aws_trace_header.ensure_parent_exists()
# TODO: test and implement Active and PassThrough tracing and sampling decisions.
# TODO: implement Lambda lineage: https://docs.aws.amazon.com/lambda/latest/dg/invocation-recursion.html
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for leaving the TODOs here. Those seem like really interesting tasks for engineers getting started at LocalStack 😉

:param payload: Invocation payload
:return: The invocation result
"""
# NOTE: consider making the trace_context mandatory once we update all usages (should be easier after v4.0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree, wanted to point out the same here 🙌

Comment on lines +58 to +59
root_trace_id = "1-3152b799-8954dae64eda91bc9a23a7e8"
xray_trace_header = TraceHeader(root=root_trace_id, parent="7fa8c0f79203be72", sampled=1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this present any issues when testing repeatedly against AWS with the same trace id / parent?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope, works fine.
AWS does not perform any magic validation trying to re-construct the implicit start timestamp encoded here :)

PS: I fixed the repeated AWS validation issues in other X-Ray tests in this ext PR (awaiting review, low prio): https://github.com/localstack/localstack-ext/pull/3596

@joe4dev joe4dev merged commit e675014 into master Oct 29, 2024
47 checks passed
@joe4dev joe4dev deleted the feature/lambda-xray-trace-propagation branch October 29, 2024 19:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
semver: minor Non-breaking changes which can be included in minor releases, but not in patch releases
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants