Capturing Initialization and Timeout errors for AWS Lambda Integration #756

shantanu73 · 2020-07-08T07:07:35Z

Changes:

Added a new wrapper decorator for post_init_error method to capture initialization error for AWS Lambda integration.
Modified _wrap_handler decorator to include code which runs a parallel thread to capture timeout error.
Modified _make_request_event_processor decorator to include execution duration as parameter.
Added TimeoutThread class in utils.py which is useful to capture timeout error.

1) Added a new wrapper decorator for post_init_error method to capture initialization error for AWS Lambda integration. 2) Modified _wrap_handler decorator to include code which runs a parallel thread to capture timeout error. 3) Modified _make_request_event_processor decorator to include execution duration as parameter. 4) Added TimeoutThread class in utils.py which is useful to capture timeout error.

untitaker

lgtm so far, please add tests though

untitaker · 2020-07-08T09:06:15Z

sentry_sdk/integrations/aws_lambda.py

+        # type: (*Any, **Any) -> Any
+
+        # Fetch Initialization error details from arguments
+        error = json.loads(args[1])


Can we move this down into line 47, no reason to do this work and potentially crash even if the integration is disabled

Sure, I'll make this change.

untitaker · 2020-07-08T09:07:11Z

sentry_sdk/integrations/aws_lambda.py

+                # environment from arguments and, re-raising it to capture it as an event.
+                if error.get("errorType"):
+                    exc_info = sys.exc_info()
+                    reraise(*exc_info)


Why reraise to catch again instead of calling sys.exc_info once? Does this change the stacktrace somehow?

Nope, this doesn't change the stacktrace.
I got your point. I'll check once and capture the details without reraising.

untitaker · 2020-07-08T09:08:46Z

sentry_sdk/integrations/aws_lambda.py

@@ -126,6 +196,10 @@ def sentry_to_json(*args, **kwargs):

            lambda_bootstrap.to_json = sentry_to_json
        else:
+            lambda_bootstrap.LambdaRuntimeClient.post_init_error = _wrap_init_error(


Note this will be executed for Python 3.7 only.

Yes, since for python environments 2.7 & 3.6 the runtime already handles these issues. So, I need not handle for those scenarios. In those cases runtime generates FaultException for initialization error.

1) As per review comments, moved the statement which fetches error details from args of sentry_init_error() method after the integration check. 2) As per review comments, removed the try-except block as execution info was available to us directly.

1) Added automation unit tests for AWS Lambda integration. 2) The test cases are for timeout and initialization errors.

rhcarvalho · 2020-07-15T08:02:56Z

tests/integrations/aws_lambda/test_aws.py

+
+def test_timeout_error(run_lambda_function):
+    # Modifying LAMBDA_PRELUDE since capturing timeout error is kept optional.
+    modified_prelude = LAMBDA_PRELUDE.replace("[AwsLambdaIntegration()]", "[AwsLambdaIntegration(True)]")


This test could silently do the wrong thing if the string "[AwsLambdaIntegration()]" is not part of LAMBDA_PRELUDE (e.g. after an unrelated code change).

Also using a positional argument is cryptic. One needs to go through hops to figure out what the True value is used for.

My suggestion is to add a new keyword argument to init_sdk defined in LAMBDA_PRELUDE:

def init_sdk(check_timeout_error=False, **extra_init_args): # ... integrations=[AwsLambdaIntegration(check_timeout_error=check_timeout_error)],

This way, instead of string replacement in the prelude, we can simply append code that uses the new argument.

init_sdk(check_timeout_error=True)

rhcarvalho · 2020-07-15T08:05:32Z

sentry_sdk/integrations/aws_lambda.py

@@ -73,6 +131,14 @@ def _drain_queue():
 class AwsLambdaIntegration(Integration):
    identifier = "aws_lambda"

+    def __init__(self, check_timeout_error=False):


check_timeout_error is a rather misleading name because it doesn't cause the integration to "check for a timeout error".

What it does is trigger a warning if you get "close enough" to the timeout.

rhcarvalho · 2020-07-15T08:08:48Z

sentry_sdk/integrations/aws_lambda.py

@@ -25,6 +27,45 @@

    F = TypeVar("F", bound=Callable[..., Any])

+# Constants
+TIMEOUT_THRESHOLD_MILLIS = 1500  # Minimum time required to capture TimeoutError


This sort of makes AwsLambdaIntegration(check_timeout_error=True) "maybe" work as intended, depending on external state that is not obvious when you initialize the integration. Do we really need this?

rhcarvalho · 2020-07-15T08:10:34Z

sentry_sdk/integrations/aws_lambda.py

+                if integration.get_check_timeout_error():
+                    # Starting the Timeout thread only if the configured time is greater than Timeout threshold value
+                    if configured_time_in_millis > TIMEOUT_THRESHOLD_MILLIS:


If we do need this threshold, then let's write a single if condition == less indentation levels.

if COND_A and COND_B: ...

rhcarvalho · 2020-07-15T08:17:02Z

sentry_sdk/integrations/aws_lambda.py

+                        configured_time_in_sec = configured_time_in_millis / SECONDS_CONVERSION_FACTOR
+                        configured_time = int(configured_time_in_sec)
+
+                        # Setting up the exact integer value of configured time(in seconds)
+                        if configured_time < configured_time_in_sec:
+                            configured_time = configured_time + 1


time.sleep can work with floating point numbers, this dance to round up to the next integer is not really needed, is it?

We could also probably get rid of SECONDS_CONVERSION_FACTOR and simply use subsecond timeout in the tests, perhaps?

rhcarvalho · 2020-07-15T08:25:16Z

sentry_sdk/utils.py

+
+
+class TimeoutThread(threading.Thread):
+    """Creates a Thread."""


This is not a useful docstring for this class.

rhcarvalho · 2020-07-15T08:26:25Z

sentry_sdk/utils.py

+    def get_timeout_duration(self):
+        # type: () -> float
+        return self.timeout_duration
+
+    def get_configured_timeout(self):
+        # type: () -> int
+        return self.configured_timeout


In Python it is not idiomatic to have this type of one line getters. Much simpler to access the properties directly.

rhcarvalho · 2020-07-15T08:29:03Z

sentry_sdk/utils.py

+        # Raising Exception after timeout duration is reached
+        raise Exception(
+            "WARNING : Function is expected to get timed out. Configured timeout duration = {} seconds".format(
+                self.get_configured_timeout())
+        )


We could include in the message the function name (if available) and how long the function has run for.

rhcarvalho · 2020-07-15T08:33:33Z

tests/integrations/aws_lambda/test_aws.py

+    )
+    expected_text = "WARNING : Function is expected to get timed out. Configured timeout duration = 4 seconds"
+    if not events:
+        # In case of Python 2.7 runtime environment


Tests should not rely on the truthy value of events and infer it is Python 2.7 or 3.x.
E.g., if something is wrong in Python 3.x and events is the empty list, the test would go into the wrong branch of the if...else.

The runtime used is tricky to access in tests right now. I would change this code:

@pytest.fixture(params=["python3.6", "python3.7", "python3.8", "python2.7"]) def run_lambda_function(request): # access to request.param somewhere here

to this:

@pytest.fixture(params=["python3.6", "python3.7", "python3.8", "python2.7"]) def lambda_runtime_version(request): return request.param def run_lambda_function(lambda_runtime_version): # access lambda_runtime_version instead

then in the test you can depend on the lambda_runtime_version fixture yourself and inspect it.

Also I am confused, does this mean timeout events are not actually working on python 2.7?

rhcarvalho · 2020-07-15T08:34:58Z

tests/integrations/aws_lambda/test_aws.py

+        except Exception as e:
+            # Exception caught in case of Initialization error
+            pass


It doesn't look right to silence the exception here. If calling the subprocess fails for some arbitrary reason, the test should fail.

Please make this behavior configurable via a flag passed to inner. Rodolfo is right that this exception should be re-raised for most tests, except for the test that tests functions failing on initialization.

In fact I would do this:

def inner(..., import_locally=True): ... if import_locally: subprocess.check_call()

The line literally only exists to validate the file before sending it to AWS. In your case you want to send a broken file to AWS intentionally.

untitaker

This is a good poc but there's some work left to do here to clean it up. Mostly I am concerned that timeouts are not working or tested correctly on python 2.7.

untitaker · 2020-07-15T10:48:22Z

sentry_sdk/utils.py

+        # type: () -> None
+        time.sleep(self.get_timeout_duration())
+        # Raising Exception after timeout duration is reached
+        raise Exception(


Some points on this:

I would prefer a dedicated exception type here, it makes filtering in the UI easier

The stacktrace of this thread is less useful than of the main thread. I am not sure if accessing the main thread's stacktrace is possible.

untitaker · 2020-07-15T10:51:53Z

sentry_sdk/integrations/aws_lambda.py


            try:
+                # Checking if parameter to check timeout is set True


Please wrap all of this in with capture_internal_exceptions() and possibly move to separate function

untitaker · 2020-07-15T10:54:57Z

tests/integrations/aws_lambda/test_aws.py

+        except Exception as e:
+            # Exception caught in case of Initialization error
+            pass


Please make this behavior configurable via a flag passed to inner. Rodolfo is right that this exception should be re-raised for most tests, except for the test that tests functions failing on initialization.

In fact I would do this:

def inner(..., import_locally=True): ... if import_locally: subprocess.check_call()

The line literally only exists to validate the file before sending it to AWS. In your case you want to send a broken file to AWS intentionally.

untitaker · 2020-07-15T10:58:44Z

tests/integrations/aws_lambda/test_aws.py

+    )
+    expected_text = "WARNING : Function is expected to get timed out. Configured timeout duration = 4 seconds"
+    if not events:
+        # In case of Python 2.7 runtime environment


The runtime used is tricky to access in tests right now. I would change this code:

@pytest.fixture(params=["python3.6", "python3.7", "python3.8", "python2.7"]) def run_lambda_function(request): # access to request.param somewhere here

to this:

@pytest.fixture(params=["python3.6", "python3.7", "python3.8", "python2.7"]) def lambda_runtime_version(request): return request.param def run_lambda_function(lambda_runtime_version): # access lambda_runtime_version instead

then in the test you can depend on the lambda_runtime_version fixture yourself and inspect it.

untitaker · 2020-07-15T11:03:29Z

tests/integrations/aws_lambda/test_aws.py

+    )
+    expected_text = "WARNING : Function is expected to get timed out. Configured timeout duration = 4 seconds"
+    if not events:
+        # In case of Python 2.7 runtime environment


Also I am confused, does this mean timeout events are not actually working on python 2.7?

untitaker · 2020-07-15T11:08:20Z

sentry_sdk/integrations/aws_lambda.py

+                        configured_time_in_sec = configured_time_in_millis / SECONDS_CONVERSION_FACTOR
+                        configured_time = int(configured_time_in_sec)


Suggested change

configured_time_in_sec = configured_time_in_millis / SECONDS_CONVERSION_FACTOR

configured_time = int(configured_time_in_sec)

configured_time_in_sec = int(configured_time_in_millis / SECONDS_CONVERSION_FACTOR)

super minor nitpick but i think we don't need so many locals here

untitaker · 2020-07-15T11:09:15Z

sentry_sdk/integrations/aws_lambda.py

+        # type: (bool) -> None
+        self.check_timeout_error = check_timeout_error
+
+    def get_check_timeout_error(self):


Don't need this, just access the attrs directly (same in Thread subclass)

sentry_sdk/integrations/aws_lambda.py

untitaker · 2020-07-15T11:19:00Z

sentry_sdk/utils.py

+        time.sleep(self.get_timeout_duration())
+        # Raising Exception after timeout duration is reached
+        raise Exception(
+            "WARNING : Function is expected to get timed out. Configured timeout duration = {} seconds".format(


Can we also show get_remaining_time_in_millis here?

1) Changed variable names as per review comments for check_timeout_error, TIMEOUT_THRESHOLD_MILLIS, SECONDS_CONVERSION_FACTOR. 2) Removed unnecessary getter methods. 3) Modified docstring for TimeoutThread class. 4) Added new context (new section) for execution data. 5) Moved logic to generate timeout warning inside capture_exception with context. 6) Parameterized subprocess.check_call() method for initialization error. 7) Created a new exception class ServerlessTimeoutWarning raised for case of timeouts. 8) Fixed other minor issues as per review comments.

untitaker · 2020-07-22T15:51:10Z

tests/integrations/aws_lambda/test_aws.py

-        (exception,) = event["exception"]["values"]
-        assert exception["type"] == "Exception"
-        assert exception["value"] == expected_text
+    log_result = (base64.b64decode(response["LogResult"])).decode("utf-8")


I guess I am still confused as to what we're asserting here. It seems we are not expecting any timeout event at all?

@untitaker
Here, we are asserting with the exception message that we are sending when the ServerlessTimeoutWarning custom exception is raise in case of timeout error. If the exception is raised it'll be there in the log result.
Also, timeout event is not coming even after the solution that you suggested. I've applied it is this manner:-

def run(thread_hub, configured_timeout):
with thread_hub:
try:
raise ServerlessTimeoutWarning("WARNING : Function is expected to get timed out. Configured timeout duration = {} seconds.".format(configured_timeout))
except Exception:
client = thread_hub.client
exc_info = sys.exc_info()
event, hint = event_from_exception( exc_info, client_options=client.options, mechanism={"type": "threads", "handled": False}, )
thread_hub.capture_event(event, hint=hint)
reraise(*exc_info)

And, while starting a new thread:

thread_hub = Hub(Hub.current)
tr = threading.Thread(target=run, args=[thread_hub, configured_timeout])
tr.start()

I've even tried with :-

thread_hub = Hub.current

This does generate timeout warning and the mechanism I provided in above code appears on Sentry dashboard, but it still doesn't show the stacktrace which was required and neither does the events data come in case f the automation test cases.

thread_hub = Hub.current ought to be correct.

I can't really tell what would be wrong but as-is the testcase is wrong. Y'all have to debug this. If it makes things easier you can also try hub.capture_message without any exception.

It's fine if there's no stacktrace. The only useful stacktrace comes from the sigalrm approach we discussed in Slack afaik.

So you mean that event should come irrespective of whatever be the test case? Maybe, it is related to the HttpTransport we are using and we might need to modify it to send event in case of timeout error. Because I've tried with all the scenrios you mentioned and still for the unit tests event data does not come.

I mean the test case right here is supposed to assert that there is an event, right? but right now it's not asserting that.

I don't think the transport is the issue tbh

Okay, I'll debug through this issue and see why event data is not coming and post the updates.

@untitaker I've debugged this issue. As I suspected earlier as well, the events are not coming because of the _send_event() method defined inside LAMBDA_PRELUDE of test_aws.py. The events =basically are not coming because of the delay of 1 second that is kept there.
Also, the reason why events are not coming after 1 second delay is because of the same reason timeout warning needs to be generated a litter before the actual timeout.

Is it very necessary to keep that delay ? can we remove that delay ?

sentry_sdk/integrations/aws_lambda.py

rhcarvalho · 2020-07-23T09:46:39Z

sentry_sdk/integrations/aws_lambda.py

+            isinstance(contexts, dict)
+            and "memory usage and execution duration" not in contexts
+        ):
+            contexts["memory usage and execution time"] = {


I didn't find docs suggesting against spaces in the context key, but in practice we tend not to use it.

Lambda runtime information seems like a good fit for the standard runtime context (note that it is okay to send more keys than the ones in the docs, the extra keys form a key-value table).
https://develop.sentry.dev/sdk/event-payloads/contexts/#runtime-context

@rhcarvalho
As per a suggestion given by AJ, I figured that a separate section for execution time and memory usage should be a good implementation. Also, the "Additional data" section which contains all data related to AWS Lambda contains space in the context key, and considering the visual aspect for UI, I kept spaces.
So, as a solution, should I keep the execution time and memory usage inside lambda details or keep the current implementation (I'll share the screenshot of current dashboard on slack)?

Also, the "Additional data" section which contains all data related to AWS Lambda contains space in the context key, and considering the visual aspect for UI, I kept spaces.

Let's stay consistent within the integration for now, but just as a side note rodolfo is right that most other integrations don't do this. It also leads to poor (or at least worse) UX when attempting to configure some other features such as server-side data scrubbing.

So, I'll put the Execution time & memory usage data in the "Additional Data" section inside the "lambda" context.

yes, just put it into extra['lambda'], you're right. Sorry about the confusion

1) Removed the memory usage data and reverted back the execution time and remaining time data in the 'lambda' key of 'addition data' context. 2) Paramterized the time.sleep() method in _send_transport() method of LAMBDA_PRELUDE to get the event data for timeout error test case.

shantanu73 · 2020-07-29T04:58:38Z

tests/integrations/aws_lambda/test_aws.py

@@ -22,13 +22,16 @@
 import json
 from sentry_sdk.transport import HttpTransport

+FLUSH_EVENT = True


@rhcarvalho @untitaker defined a global constant to parameterize timeout error to capture events data.

yeah that works for me

rhcarvalho

Thanks for doing this @shantanu73.
I just have a few comments that could be addressed in a follow up.

rhcarvalho · 2020-07-29T10:00:47Z

sentry_sdk/integrations/aws_lambda.py

+            return init_error(*args, **kwargs)
+
+        # Fetch Initialization error details from arguments
+        error = json.loads(args[1])


I'd be more defensive here. This will blow up if called with less than two positional arguments or if the arg is not valid JSON.

Can be done in a follow up.

rhcarvalho · 2020-07-29T10:03:28Z

sentry_sdk/integrations/aws_lambda.py

-def _make_request_event_processor(aws_event, aws_context):
-    # type: (Any, Any) -> EventProcessor
+def _make_request_event_processor(aws_event, aws_context, configured_timeout):
+    # type: (Any, Any, Any) -> EventProcessor


The type here is well-known, float.

rhcarvalho · 2020-07-29T10:07:40Z

tests/integrations/aws_lambda/test_aws.py

+        if FLUSH_EVENT:
+            time.sleep(1)


Hard to understand for maintenance why FLUSH_EVENT toggles calling time.sleep(1).

I'd think it would control calling sentry.flush() 😕

Might be a matter of naming?

…hod in test_aws.py

untitaker requested changes Jul 8, 2020

View reviewed changes

Shantanu Dhiman added 2 commits July 8, 2020 15:58

Changes:

1bf1087

1) As per review comments, moved the statement which fetches error details from args of sentry_init_error() method after the integration check. 2) As per review comments, removed the try-except block as execution info was available to us directly.

Changes:

e53bde2

1) Added automation unit tests for AWS Lambda integration. 2) The test cases are for timeout and initialization errors.

rhcarvalho suggested changes Jul 15, 2020

View reviewed changes

untitaker requested changes Jul 15, 2020

View reviewed changes

untitaker reviewed Jul 22, 2020

View reviewed changes

shantanu73 commented Jul 22, 2020

View reviewed changes

sentry_sdk/integrations/aws_lambda.py Outdated Show resolved Hide resolved

rhcarvalho reviewed Jul 23, 2020

View reviewed changes

shantanu73 commented Jul 29, 2020

View reviewed changes

Merge branch 'master' into master

5763604

rhcarvalho self-requested a review July 29, 2020 09:46

untitaker approved these changes Jul 29, 2020

View reviewed changes

rhcarvalho approved these changes Jul 29, 2020

View reviewed changes

Fixed linting failure by adding syntax_check parameter in inner() met…

fcc06d5

…hod in test_aws.py

untitaker merged commit f7c494b into getsentry:master Jul 29, 2020



		class TimeoutThread(threading.Thread):
		"""Creates a Thread."""

		configured_time_in_sec = configured_time_in_millis / SECONDS_CONVERSION_FACTOR
		configured_time = int(configured_time_in_sec)

	configured_time_in_sec = configured_time_in_millis / SECONDS_CONVERSION_FACTOR
	configured_time = int(configured_time_in_sec)
	configured_time_in_sec = int(configured_time_in_millis / SECONDS_CONVERSION_FACTOR)

Capturing Initialization and Timeout errors for AWS Lambda Integration #756

Capturing Initialization and Timeout errors for AWS Lambda Integration #756

Uh oh!

Conversation

shantanu73 commented Jul 8, 2020

Uh oh!

untitaker left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

untitaker left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shantanu73 Jul 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shantanu73 Jul 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

shantanu73 Jul 28, 2020 •

edited

Loading

shantanu73 Jul 28, 2020 •

edited

Loading

rhcarvalho Jul 23, 2020 •

edited

Loading

shantanu73 Jul 28, 2020 •

edited

Loading