-
-
Notifications
You must be signed in to change notification settings - Fork 4.2k
[WIP] Step Functions: Telemetry #12581
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some questions and discussion items regarding the optimal place of instrumentation and the trade-off regarding feature/parameter vs. operation tracking.
@@ -6,6 +6,7 @@ | |||
import time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think the new standard name is analytics.py
(following the Notion guide and samples in API Gateway, Lambda, etc)
@@ -789,6 +790,10 @@ def start_execution( | |||
state_machine_arn_parts[1] if len(state_machine_arn_parts) == 2 else None | |||
) | |||
|
|||
# Count metrics about the execution type. | |||
is_mock_test_case: bool = mock_test_case_name is not None | |||
UsageMetrics.execution_type_counter.labels(is_mock_test_case=is_mock_test_case).increment() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: Is this the right place to increment the counter given we could raise multiple exceptions following this counter (e.g., _raise_state_machine_does_not_exist
or InvalidExecutionInput
)?
|
||
# Initialize a counter to record the use of each execution type. | ||
execution_type_counter = Counter( | ||
namespace="stepfunctions", name="execution_type", labels=["is_mock_test_case"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
naming: Isn't execution
sufficient because the _type
sounds like an aspect that a label should cover. See "Metric & Label Best Practices" in Notion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion/idea: Following the Lambda example in localstack-core/localstack/services/lambda_/analytics.py:function_counter
, what about the following focusing on the main operation(s) rather than having feature-specific counters (e.g., JSONata, is_mock_test_case, workflow_type, invocation_type, ...):
execution_counter = Counter(
namespace="stepfunctions"
name="execution",
labels=[
"is_mock_test_case",
"workflow_type", # standard | express
"invocation_type", # async | sync
],
)
Is it worthwhile (i.e., can we derive something actionable) from capturing more metadata linked to executions?
In Lambda, we even use an operation
field to capture create and invoke activities separately. Feel free to weigh in on what's the best model for StepFunctions here.
sidenote: The dimension status
might be interesting to learn about unexpected/unhandled errors (i.e., try/catch), the more metadata we collect about executions. However, if all executions happen through the API, unhandled exceptions should be tracked generically by our ASF-based telemetry (including actionable stack traces in DEBUG mode).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is your recommendation/guidance from the data side @vittoriopolverino, especially regarding the trade-off: counters for features/parameters (e.g., JSONata, is_mock_test_case, workflow_type, invocation_type) vs. for operations (e.g., execute, create) with features as labels?
@@ -789,6 +790,10 @@ def start_execution( | |||
state_machine_arn_parts[1] if len(state_machine_arn_parts) == 2 else None | |||
) | |||
|
|||
# Count metrics about the execution type. | |||
is_mock_test_case: bool = mock_test_case_name is not None | |||
UsageMetrics.execution_type_counter.labels(is_mock_test_case=is_mock_test_case).increment() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: Are we missing other entry points, such as start_sync_execution
? Are they worth tracking?
Thanks @joe4dev for sharing your thoughts! |
Motivation
The mocked service integrations feature for Step Functions testing was recently introduced, but no telemetry was added to track its usage. These changes address that gap by recording usage of the associated environment variable,
SFN_MOCK_CONFIG
, without capturing the file path it points to.Moreover, these changes introduce a new analytics usage counter to capture the number of executions by type. For now, this includes the
is_mock_test_case
execution type, which counts whether an execution is initiated as part of a mocked service integrations test case. This counter may be extended to later include additional labels to distinguish between standard and express executions.Changes
SFN_MOCK_CONFIG
to analytics'PRESENCE_ENV_VAR
listexecution_type_counter
usage counteris_mock_test_case
uponstart_execution
calls