Skip to content

Capturing Initialization and Timeout errors for AWS Lambda Integration #756

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Jul 29, 2020
Merged

Capturing Initialization and Timeout errors for AWS Lambda Integration #756

merged 7 commits into from
Jul 29, 2020

Conversation

shantanu73
Copy link
Contributor

Changes:

  1. Added a new wrapper decorator for post_init_error method to capture initialization error for AWS Lambda integration.
  2. Modified _wrap_handler decorator to include code which runs a parallel thread to capture timeout error.
  3. Modified _make_request_event_processor decorator to include execution duration as parameter.
  4. Added TimeoutThread class in utils.py which is useful to capture timeout error.

1) Added a new wrapper decorator for post_init_error method to capture initialization error for AWS Lambda integration.
2) Modified _wrap_handler decorator to include code which runs a parallel thread to capture timeout error.
3) Modified _make_request_event_processor decorator to include execution duration as parameter.
4) Added TimeoutThread class in utils.py which is useful to capture timeout error.
Copy link
Member

@untitaker untitaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm so far, please add tests though

# type: (*Any, **Any) -> Any

# Fetch Initialization error details from arguments
error = json.loads(args[1])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this down into line 47, no reason to do this work and potentially crash even if the integration is disabled

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll make this change.

# environment from arguments and, re-raising it to capture it as an event.
if error.get("errorType"):
exc_info = sys.exc_info()
reraise(*exc_info)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why reraise to catch again instead of calling sys.exc_info once? Does this change the stacktrace somehow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, this doesn't change the stacktrace.
I got your point. I'll check once and capture the details without reraising.

@@ -126,6 +196,10 @@ def sentry_to_json(*args, **kwargs):

lambda_bootstrap.to_json = sentry_to_json
else:
lambda_bootstrap.LambdaRuntimeClient.post_init_error = _wrap_init_error(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note this will be executed for Python 3.7 only.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, since for python environments 2.7 & 3.6 the runtime already handles these issues. So, I need not handle for those scenarios. In those cases runtime generates FaultException for initialization error.

Shantanu Dhiman added 2 commits July 8, 2020 15:58
1) As per review comments, moved the statement which fetches error details from args of sentry_init_error() method after the integration check.
2) As per review comments, removed the try-except block as execution info was available to us directly.
1) Added automation unit tests for AWS Lambda integration.
2) The test cases are for timeout and initialization errors.

def test_timeout_error(run_lambda_function):
# Modifying LAMBDA_PRELUDE since capturing timeout error is kept optional.
modified_prelude = LAMBDA_PRELUDE.replace("[AwsLambdaIntegration()]", "[AwsLambdaIntegration(True)]")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test could silently do the wrong thing if the string "[AwsLambdaIntegration()]" is not part of LAMBDA_PRELUDE (e.g. after an unrelated code change).

Also using a positional argument is cryptic. One needs to go through hops to figure out what the True value is used for.

My suggestion is to add a new keyword argument to init_sdk defined in LAMBDA_PRELUDE:

def init_sdk(check_timeout_error=False, **extra_init_args):
    # ...
        integrations=[AwsLambdaIntegration(check_timeout_error=check_timeout_error)],

This way, instead of string replacement in the prelude, we can simply append code that uses the new argument.

init_sdk(check_timeout_error=True)

@@ -73,6 +131,14 @@ def _drain_queue():
class AwsLambdaIntegration(Integration):
identifier = "aws_lambda"

def __init__(self, check_timeout_error=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check_timeout_error is a rather misleading name because it doesn't cause the integration to "check for a timeout error".

What it does is trigger a warning if you get "close enough" to the timeout.

@@ -25,6 +27,45 @@

F = TypeVar("F", bound=Callable[..., Any])

# Constants
TIMEOUT_THRESHOLD_MILLIS = 1500 # Minimum time required to capture TimeoutError
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sort of makes AwsLambdaIntegration(check_timeout_error=True) "maybe" work as intended, depending on external state that is not obvious when you initialize the integration. Do we really need this?

Comment on lines 91 to 93
if integration.get_check_timeout_error():
# Starting the Timeout thread only if the configured time is greater than Timeout threshold value
if configured_time_in_millis > TIMEOUT_THRESHOLD_MILLIS:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do need this threshold, then let's write a single if condition == less indentation levels.

if COND_A and COND_B:
   ...

Comment on lines 96 to 101
configured_time_in_sec = configured_time_in_millis / SECONDS_CONVERSION_FACTOR
configured_time = int(configured_time_in_sec)

# Setting up the exact integer value of configured time(in seconds)
if configured_time < configured_time_in_sec:
configured_time = configured_time + 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

time.sleep can work with floating point numbers, this dance to round up to the next integer is not really needed, is it?

We could also probably get rid of SECONDS_CONVERSION_FACTOR and simply use subsecond timeout in the tests, perhaps?



class TimeoutThread(threading.Thread):
"""Creates a Thread."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a useful docstring for this class.

Comment on lines 886 to 892
def get_timeout_duration(self):
# type: () -> float
return self.timeout_duration

def get_configured_timeout(self):
# type: () -> int
return self.configured_timeout
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Python it is not idiomatic to have this type of one line getters. Much simpler to access the properties directly.

Comment on lines 897 to 901
# Raising Exception after timeout duration is reached
raise Exception(
"WARNING : Function is expected to get timed out. Configured timeout duration = {} seconds".format(
self.get_configured_timeout())
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could include in the message the function name (if available) and how long the function has run for.

)
expected_text = "WARNING : Function is expected to get timed out. Configured timeout duration = 4 seconds"
if not events:
# In case of Python 2.7 runtime environment
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests should not rely on the truthy value of events and infer it is Python 2.7 or 3.x.
E.g., if something is wrong in Python 3.x and events is the empty list, the test would go into the wrong branch of the if...else.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The runtime used is tricky to access in tests right now. I would change this code:

@pytest.fixture(params=["python3.6", "python3.7", "python3.8", "python2.7"])
def run_lambda_function(request):
    # access to request.param somewhere here

to this:

@pytest.fixture(params=["python3.6", "python3.7", "python3.8", "python2.7"])
def lambda_runtime_version(request):
    return request.param

def run_lambda_function(lambda_runtime_version):
    # access lambda_runtime_version instead

then in the test you can depend on the lambda_runtime_version fixture yourself and inspect it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I am confused, does this mean timeout events are not actually working on python 2.7?

Comment on lines 72 to 74
except Exception as e:
# Exception caught in case of Initialization error
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't look right to silence the exception here. If calling the subprocess fails for some arbitrary reason, the test should fail.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make this behavior configurable via a flag passed to inner. Rodolfo is right that this exception should be re-raised for most tests, except for the test that tests functions failing on initialization.

In fact I would do this:

def inner(..., import_locally=True):
    ...
    if import_locally:
        subprocess.check_call()

The line literally only exists to validate the file before sending it to AWS. In your case you want to send a broken file to AWS intentionally.

Copy link
Member

@untitaker untitaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good poc but there's some work left to do here to clean it up. Mostly I am concerned that timeouts are not working or tested correctly on python 2.7.

# type: () -> None
time.sleep(self.get_timeout_duration())
# Raising Exception after timeout duration is reached
raise Exception(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some points on this:

  • I would prefer a dedicated exception type here, it makes filtering in the UI easier
  • The stacktrace of this thread is less useful than of the main thread. I am not sure if accessing the main thread's stacktrace is possible.


try:
# Checking if parameter to check timeout is set True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please wrap all of this in with capture_internal_exceptions() and possibly move to separate function

Comment on lines 72 to 74
except Exception as e:
# Exception caught in case of Initialization error
pass
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make this behavior configurable via a flag passed to inner. Rodolfo is right that this exception should be re-raised for most tests, except for the test that tests functions failing on initialization.

In fact I would do this:

def inner(..., import_locally=True):
    ...
    if import_locally:
        subprocess.check_call()

The line literally only exists to validate the file before sending it to AWS. In your case you want to send a broken file to AWS intentionally.

)
expected_text = "WARNING : Function is expected to get timed out. Configured timeout duration = 4 seconds"
if not events:
# In case of Python 2.7 runtime environment
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The runtime used is tricky to access in tests right now. I would change this code:

@pytest.fixture(params=["python3.6", "python3.7", "python3.8", "python2.7"])
def run_lambda_function(request):
    # access to request.param somewhere here

to this:

@pytest.fixture(params=["python3.6", "python3.7", "python3.8", "python2.7"])
def lambda_runtime_version(request):
    return request.param

def run_lambda_function(lambda_runtime_version):
    # access lambda_runtime_version instead

then in the test you can depend on the lambda_runtime_version fixture yourself and inspect it.

)
expected_text = "WARNING : Function is expected to get timed out. Configured timeout duration = 4 seconds"
if not events:
# In case of Python 2.7 runtime environment
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I am confused, does this mean timeout events are not actually working on python 2.7?

Comment on lines 96 to 97
configured_time_in_sec = configured_time_in_millis / SECONDS_CONVERSION_FACTOR
configured_time = int(configured_time_in_sec)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
configured_time_in_sec = configured_time_in_millis / SECONDS_CONVERSION_FACTOR
configured_time = int(configured_time_in_sec)
configured_time_in_sec = int(configured_time_in_millis / SECONDS_CONVERSION_FACTOR)

super minor nitpick but i think we don't need so many locals here

# type: (bool) -> None
self.check_timeout_error = check_timeout_error

def get_check_timeout_error(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need this, just access the attrs directly (same in Thread subclass)

time.sleep(self.get_timeout_duration())
# Raising Exception after timeout duration is reached
raise Exception(
"WARNING : Function is expected to get timed out. Configured timeout duration = {} seconds".format(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also show get_remaining_time_in_millis here?

    1) Changed variable names as per review comments for check_timeout_error, TIMEOUT_THRESHOLD_MILLIS,
       SECONDS_CONVERSION_FACTOR.
    2) Removed unnecessary getter methods.
    3) Modified docstring for TimeoutThread class.
    4) Added new context (new section) for execution data.
    5) Moved logic to generate timeout warning inside capture_exception with context.
    6) Parameterized subprocess.check_call() method for initialization error.
    7) Created a new exception class ServerlessTimeoutWarning raised for case of timeouts.
    8) Fixed other minor issues as per review comments.
(exception,) = event["exception"]["values"]
assert exception["type"] == "Exception"
assert exception["value"] == expected_text
log_result = (base64.b64decode(response["LogResult"])).decode("utf-8")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I am still confused as to what we're asserting here. It seems we are not expecting any timeout event at all?

Copy link
Contributor Author

@shantanu73 shantanu73 Jul 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@untitaker
Here, we are asserting with the exception message that we are sending when the ServerlessTimeoutWarning custom exception is raise in case of timeout error. If the exception is raised it'll be there in the log result.
Also, timeout event is not coming even after the solution that you suggested. I've applied it is this manner:-

def run(thread_hub, configured_timeout):
with thread_hub:
try:
raise ServerlessTimeoutWarning("WARNING : Function is expected to get timed out. Configured timeout duration = {} seconds.".format(configured_timeout))
except Exception:
client = thread_hub.client
exc_info = sys.exc_info()
event, hint = event_from_exception( exc_info, client_options=client.options, mechanism={"type": "threads", "handled": False}, )
thread_hub.capture_event(event, hint=hint)
reraise(*exc_info)

And, while starting a new thread:

thread_hub = Hub(Hub.current)
tr = threading.Thread(target=run, args=[thread_hub, configured_timeout])
tr.start()

I've even tried with :-

thread_hub = Hub.current

This does generate timeout warning and the mechanism I provided in above code appears on Sentry dashboard, but it still doesn't show the stacktrace which was required and neither does the events data come in case f the automation test cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thread_hub = Hub.current ought to be correct.

I can't really tell what would be wrong but as-is the testcase is wrong. Y'all have to debug this. If it makes things easier you can also try hub.capture_message without any exception.

It's fine if there's no stacktrace. The only useful stacktrace comes from the sigalrm approach we discussed in Slack afaik.

Copy link
Contributor Author

@shantanu73 shantanu73 Jul 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you mean that event should come irrespective of whatever be the test case? Maybe, it is related to the HttpTransport we are using and we might need to modify it to send event in case of timeout error. Because I've tried with all the scenrios you mentioned and still for the unit tests event data does not come.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean the test case right here is supposed to assert that there is an event, right? but right now it's not asserting that.

I don't think the transport is the issue tbh

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I'll debug through this issue and see why event data is not coming and post the updates.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@untitaker I've debugged this issue. As I suspected earlier as well, the events are not coming because of the _send_event() method defined inside LAMBDA_PRELUDE of test_aws.py. The events =basically are not coming because of the delay of 1 second that is kept there.
Also, the reason why events are not coming after 1 second delay is because of the same reason timeout warning needs to be generated a litter before the actual timeout.

Is it very necessary to keep that delay ? can we remove that delay ?

isinstance(contexts, dict)
and "memory usage and execution duration" not in contexts
):
contexts["memory usage and execution time"] = {
Copy link
Contributor

@rhcarvalho rhcarvalho Jul 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't find docs suggesting against spaces in the context key, but in practice we tend not to use it.

Lambda runtime information seems like a good fit for the standard runtime context (note that it is okay to send more keys than the ones in the docs, the extra keys form a key-value table).
https://develop.sentry.dev/sdk/event-payloads/contexts/#runtime-context

Copy link
Contributor Author

@shantanu73 shantanu73 Jul 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rhcarvalho
As per a suggestion given by AJ, I figured that a separate section for execution time and memory usage should be a good implementation. Also, the "Additional data" section which contains all data related to AWS Lambda contains space in the context key, and considering the visual aspect for UI, I kept spaces.
So, as a solution, should I keep the execution time and memory usage inside lambda details or keep the current implementation (I'll share the screenshot of current dashboard on slack)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the "Additional data" section which contains all data related to AWS Lambda contains space in the context key, and considering the visual aspect for UI, I kept spaces.

Let's stay consistent within the integration for now, but just as a side note rodolfo is right that most other integrations don't do this. It also leads to poor (or at least worse) UX when attempting to configure some other features such as server-side data scrubbing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I'll put the Execution time & memory usage data in the "Additional Data" section inside the "lambda" context.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, just put it into extra['lambda'], you're right. Sorry about the confusion

1) Removed the memory usage data and reverted back the execution time and remaining time data in the 'lambda' key of 'addition data' context.
2) Paramterized the time.sleep() method in _send_transport() method of LAMBDA_PRELUDE to get the event data for timeout error test case.
@@ -22,13 +22,16 @@
import json
from sentry_sdk.transport import HttpTransport

FLUSH_EVENT = True
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rhcarvalho @untitaker defined a global constant to parameterize timeout error to capture events data.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah that works for me

@rhcarvalho rhcarvalho self-requested a review July 29, 2020 09:46
Copy link
Contributor

@rhcarvalho rhcarvalho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this @shantanu73.
I just have a few comments that could be addressed in a follow up.

return init_error(*args, **kwargs)

# Fetch Initialization error details from arguments
error = json.loads(args[1])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be more defensive here. This will blow up if called with less than two positional arguments or if the arg is not valid JSON.

Can be done in a follow up.

def _make_request_event_processor(aws_event, aws_context):
# type: (Any, Any) -> EventProcessor
def _make_request_event_processor(aws_event, aws_context, configured_timeout):
# type: (Any, Any, Any) -> EventProcessor
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type here is well-known, float.

Comment on lines +33 to +34
if FLUSH_EVENT:
time.sleep(1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hard to understand for maintenance why FLUSH_EVENT toggles calling time.sleep(1).

I'd think it would control calling sentry.flush() 😕

Might be a matter of naming?

@untitaker untitaker merged commit f7c494b into getsentry:master Jul 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants