-
Notifications
You must be signed in to change notification settings - Fork 549
Capturing Initialization and Timeout errors for AWS Lambda Integration #756
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
1) Added a new wrapper decorator for post_init_error method to capture initialization error for AWS Lambda integration. 2) Modified _wrap_handler decorator to include code which runs a parallel thread to capture timeout error. 3) Modified _make_request_event_processor decorator to include execution duration as parameter. 4) Added TimeoutThread class in utils.py which is useful to capture timeout error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm so far, please add tests though
# type: (*Any, **Any) -> Any | ||
|
||
# Fetch Initialization error details from arguments | ||
error = json.loads(args[1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move this down into line 47, no reason to do this work and potentially crash even if the integration is disabled
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I'll make this change.
# environment from arguments and, re-raising it to capture it as an event. | ||
if error.get("errorType"): | ||
exc_info = sys.exc_info() | ||
reraise(*exc_info) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why reraise to catch again instead of calling sys.exc_info
once? Does this change the stacktrace somehow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, this doesn't change the stacktrace.
I got your point. I'll check once and capture the details without reraising.
@@ -126,6 +196,10 @@ def sentry_to_json(*args, **kwargs): | |||
|
|||
lambda_bootstrap.to_json = sentry_to_json | |||
else: | |||
lambda_bootstrap.LambdaRuntimeClient.post_init_error = _wrap_init_error( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note this will be executed for Python 3.7 only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, since for python environments 2.7 & 3.6 the runtime already handles these issues. So, I need not handle for those scenarios. In those cases runtime generates FaultException for initialization error.
|
||
def test_timeout_error(run_lambda_function): | ||
# Modifying LAMBDA_PRELUDE since capturing timeout error is kept optional. | ||
modified_prelude = LAMBDA_PRELUDE.replace("[AwsLambdaIntegration()]", "[AwsLambdaIntegration(True)]") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test could silently do the wrong thing if the string "[AwsLambdaIntegration()]"
is not part of LAMBDA_PRELUDE
(e.g. after an unrelated code change).
Also using a positional argument is cryptic. One needs to go through hops to figure out what the True
value is used for.
My suggestion is to add a new keyword argument to init_sdk
defined in LAMBDA_PRELUDE
:
def init_sdk(check_timeout_error=False, **extra_init_args):
# ...
integrations=[AwsLambdaIntegration(check_timeout_error=check_timeout_error)],
This way, instead of string replacement in the prelude, we can simply append code that uses the new argument.
init_sdk(check_timeout_error=True)
@@ -73,6 +131,14 @@ def _drain_queue(): | |||
class AwsLambdaIntegration(Integration): | |||
identifier = "aws_lambda" | |||
|
|||
def __init__(self, check_timeout_error=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check_timeout_error
is a rather misleading name because it doesn't cause the integration to "check for a timeout error".
What it does is trigger a warning if you get "close enough" to the timeout.
@@ -25,6 +27,45 @@ | |||
|
|||
F = TypeVar("F", bound=Callable[..., Any]) | |||
|
|||
# Constants | |||
TIMEOUT_THRESHOLD_MILLIS = 1500 # Minimum time required to capture TimeoutError |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sort of makes AwsLambdaIntegration(check_timeout_error=True)
"maybe" work as intended, depending on external state that is not obvious when you initialize the integration. Do we really need this?
if integration.get_check_timeout_error(): | ||
# Starting the Timeout thread only if the configured time is greater than Timeout threshold value | ||
if configured_time_in_millis > TIMEOUT_THRESHOLD_MILLIS: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we do need this threshold, then let's write a single if
condition == less indentation levels.
if COND_A and COND_B:
...
configured_time_in_sec = configured_time_in_millis / SECONDS_CONVERSION_FACTOR | ||
configured_time = int(configured_time_in_sec) | ||
|
||
# Setting up the exact integer value of configured time(in seconds) | ||
if configured_time < configured_time_in_sec: | ||
configured_time = configured_time + 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
time.sleep
can work with floating point numbers, this dance to round up to the next integer is not really needed, is it?
We could also probably get rid of SECONDS_CONVERSION_FACTOR
and simply use subsecond timeout in the tests, perhaps?
sentry_sdk/utils.py
Outdated
|
||
|
||
class TimeoutThread(threading.Thread): | ||
"""Creates a Thread.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a useful docstring for this class.
sentry_sdk/utils.py
Outdated
def get_timeout_duration(self): | ||
# type: () -> float | ||
return self.timeout_duration | ||
|
||
def get_configured_timeout(self): | ||
# type: () -> int | ||
return self.configured_timeout |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Python it is not idiomatic to have this type of one line getters. Much simpler to access the properties directly.
sentry_sdk/utils.py
Outdated
# Raising Exception after timeout duration is reached | ||
raise Exception( | ||
"WARNING : Function is expected to get timed out. Configured timeout duration = {} seconds".format( | ||
self.get_configured_timeout()) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could include in the message the function name (if available) and how long the function has run for.
) | ||
expected_text = "WARNING : Function is expected to get timed out. Configured timeout duration = 4 seconds" | ||
if not events: | ||
# In case of Python 2.7 runtime environment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests should not rely on the truthy value of events
and infer it is Python 2.7 or 3.x.
E.g., if something is wrong in Python 3.x and events
is the empty list, the test would go into the wrong branch of the if...else
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The runtime used is tricky to access in tests right now. I would change this code:
@pytest.fixture(params=["python3.6", "python3.7", "python3.8", "python2.7"])
def run_lambda_function(request):
# access to request.param somewhere here
to this:
@pytest.fixture(params=["python3.6", "python3.7", "python3.8", "python2.7"])
def lambda_runtime_version(request):
return request.param
def run_lambda_function(lambda_runtime_version):
# access lambda_runtime_version instead
then in the test you can depend on the lambda_runtime_version
fixture yourself and inspect it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I am confused, does this mean timeout events are not actually working on python 2.7?
except Exception as e: | ||
# Exception caught in case of Initialization error | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't look right to silence the exception here. If calling the subprocess fails for some arbitrary reason, the test should fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make this behavior configurable via a flag passed to inner
. Rodolfo is right that this exception should be re-raised for most tests, except for the test that tests functions failing on initialization.
In fact I would do this:
def inner(..., import_locally=True):
...
if import_locally:
subprocess.check_call()
The line literally only exists to validate the file before sending it to AWS. In your case you want to send a broken file to AWS intentionally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good poc but there's some work left to do here to clean it up. Mostly I am concerned that timeouts are not working or tested correctly on python 2.7.
sentry_sdk/utils.py
Outdated
# type: () -> None | ||
time.sleep(self.get_timeout_duration()) | ||
# Raising Exception after timeout duration is reached | ||
raise Exception( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some points on this:
- I would prefer a dedicated exception type here, it makes filtering in the UI easier
- The stacktrace of this thread is less useful than of the main thread. I am not sure if accessing the main thread's stacktrace is possible.
|
||
try: | ||
# Checking if parameter to check timeout is set True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please wrap all of this in with capture_internal_exceptions()
and possibly move to separate function
except Exception as e: | ||
# Exception caught in case of Initialization error | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make this behavior configurable via a flag passed to inner
. Rodolfo is right that this exception should be re-raised for most tests, except for the test that tests functions failing on initialization.
In fact I would do this:
def inner(..., import_locally=True):
...
if import_locally:
subprocess.check_call()
The line literally only exists to validate the file before sending it to AWS. In your case you want to send a broken file to AWS intentionally.
) | ||
expected_text = "WARNING : Function is expected to get timed out. Configured timeout duration = 4 seconds" | ||
if not events: | ||
# In case of Python 2.7 runtime environment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The runtime used is tricky to access in tests right now. I would change this code:
@pytest.fixture(params=["python3.6", "python3.7", "python3.8", "python2.7"])
def run_lambda_function(request):
# access to request.param somewhere here
to this:
@pytest.fixture(params=["python3.6", "python3.7", "python3.8", "python2.7"])
def lambda_runtime_version(request):
return request.param
def run_lambda_function(lambda_runtime_version):
# access lambda_runtime_version instead
then in the test you can depend on the lambda_runtime_version
fixture yourself and inspect it.
) | ||
expected_text = "WARNING : Function is expected to get timed out. Configured timeout duration = 4 seconds" | ||
if not events: | ||
# In case of Python 2.7 runtime environment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I am confused, does this mean timeout events are not actually working on python 2.7?
configured_time_in_sec = configured_time_in_millis / SECONDS_CONVERSION_FACTOR | ||
configured_time = int(configured_time_in_sec) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
configured_time_in_sec = configured_time_in_millis / SECONDS_CONVERSION_FACTOR | |
configured_time = int(configured_time_in_sec) | |
configured_time_in_sec = int(configured_time_in_millis / SECONDS_CONVERSION_FACTOR) |
super minor nitpick but i think we don't need so many locals here
# type: (bool) -> None | ||
self.check_timeout_error = check_timeout_error | ||
|
||
def get_check_timeout_error(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't need this, just access the attrs directly (same in Thread subclass)
sentry_sdk/utils.py
Outdated
time.sleep(self.get_timeout_duration()) | ||
# Raising Exception after timeout duration is reached | ||
raise Exception( | ||
"WARNING : Function is expected to get timed out. Configured timeout duration = {} seconds".format( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also show get_remaining_time_in_millis
here?
1) Changed variable names as per review comments for check_timeout_error, TIMEOUT_THRESHOLD_MILLIS, SECONDS_CONVERSION_FACTOR. 2) Removed unnecessary getter methods. 3) Modified docstring for TimeoutThread class. 4) Added new context (new section) for execution data. 5) Moved logic to generate timeout warning inside capture_exception with context. 6) Parameterized subprocess.check_call() method for initialization error. 7) Created a new exception class ServerlessTimeoutWarning raised for case of timeouts. 8) Fixed other minor issues as per review comments.
(exception,) = event["exception"]["values"] | ||
assert exception["type"] == "Exception" | ||
assert exception["value"] == expected_text | ||
log_result = (base64.b64decode(response["LogResult"])).decode("utf-8") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I am still confused as to what we're asserting here. It seems we are not expecting any timeout event at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@untitaker
Here, we are asserting with the exception message that we are sending when the ServerlessTimeoutWarning custom exception is raise in case of timeout error. If the exception is raised it'll be there in the log result.
Also, timeout event is not coming even after the solution that you suggested. I've applied it is this manner:-
def run(thread_hub, configured_timeout):
with thread_hub:
try:
raise ServerlessTimeoutWarning("WARNING : Function is expected to get timed out. Configured timeout duration = {} seconds.".format(configured_timeout))
except Exception:
client = thread_hub.client
exc_info = sys.exc_info()
event, hint = event_from_exception( exc_info, client_options=client.options, mechanism={"type": "threads", "handled": False}, )
thread_hub.capture_event(event, hint=hint)
reraise(*exc_info)
And, while starting a new thread:
thread_hub = Hub(Hub.current)
tr = threading.Thread(target=run, args=[thread_hub, configured_timeout])
tr.start()
I've even tried with :-
thread_hub = Hub.current
This does generate timeout warning and the mechanism I provided in above code appears on Sentry dashboard, but it still doesn't show the stacktrace which was required and neither does the events data come in case f the automation test cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thread_hub = Hub.current
ought to be correct.
I can't really tell what would be wrong but as-is the testcase is wrong. Y'all have to debug this. If it makes things easier you can also try hub.capture_message
without any exception.
It's fine if there's no stacktrace. The only useful stacktrace comes from the sigalrm approach we discussed in Slack afaik.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you mean that event should come irrespective of whatever be the test case? Maybe, it is related to the HttpTransport we are using and we might need to modify it to send event in case of timeout error. Because I've tried with all the scenrios you mentioned and still for the unit tests event data does not come.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean the test case right here is supposed to assert that there is an event, right? but right now it's not asserting that.
I don't think the transport is the issue tbh
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I'll debug through this issue and see why event data is not coming and post the updates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@untitaker I've debugged this issue. As I suspected earlier as well, the events are not coming because of the _send_event() method defined inside LAMBDA_PRELUDE of test_aws.py. The events =basically are not coming because of the delay of 1 second that is kept there.
Also, the reason why events are not coming after 1 second delay is because of the same reason timeout warning needs to be generated a litter before the actual timeout.
Is it very necessary to keep that delay ? can we remove that delay ?
isinstance(contexts, dict) | ||
and "memory usage and execution duration" not in contexts | ||
): | ||
contexts["memory usage and execution time"] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't find docs suggesting against spaces in the context key, but in practice we tend not to use it.
Lambda runtime information seems like a good fit for the standard runtime
context (note that it is okay to send more keys than the ones in the docs, the extra keys form a key-value table).
https://develop.sentry.dev/sdk/event-payloads/contexts/#runtime-context
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rhcarvalho
As per a suggestion given by AJ, I figured that a separate section for execution time and memory usage should be a good implementation. Also, the "Additional data" section which contains all data related to AWS Lambda contains space in the context key, and considering the visual aspect for UI, I kept spaces.
So, as a solution, should I keep the execution time and memory usage inside lambda details or keep the current implementation (I'll share the screenshot of current dashboard on slack)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, the "Additional data" section which contains all data related to AWS Lambda contains space in the context key, and considering the visual aspect for UI, I kept spaces.
Let's stay consistent within the integration for now, but just as a side note rodolfo is right that most other integrations don't do this. It also leads to poor (or at least worse) UX when attempting to configure some other features such as server-side data scrubbing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, I'll put the Execution time & memory usage data in the "Additional Data" section inside the "lambda" context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, just put it into extra['lambda']
, you're right. Sorry about the confusion
@@ -22,13 +22,16 @@ | |||
import json | |||
from sentry_sdk.transport import HttpTransport | |||
|
|||
FLUSH_EVENT = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rhcarvalho @untitaker defined a global constant to parameterize timeout error to capture events data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah that works for me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this @shantanu73.
I just have a few comments that could be addressed in a follow up.
return init_error(*args, **kwargs) | ||
|
||
# Fetch Initialization error details from arguments | ||
error = json.loads(args[1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be more defensive here. This will blow up if called with less than two positional arguments or if the arg is not valid JSON.
Can be done in a follow up.
def _make_request_event_processor(aws_event, aws_context): | ||
# type: (Any, Any) -> EventProcessor | ||
def _make_request_event_processor(aws_event, aws_context, configured_timeout): | ||
# type: (Any, Any, Any) -> EventProcessor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The type here is well-known, float
.
if FLUSH_EVENT: | ||
time.sleep(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hard to understand for maintenance why FLUSH_EVENT
toggles calling time.sleep(1)
.
I'd think it would control calling sentry.flush()
😕
Might be a matter of naming?
…hod in test_aws.py
Changes: