Skip to content

[WIP] Step Functions: Telemetry #12581

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

MEPalma
Copy link
Contributor

@MEPalma MEPalma commented May 5, 2025

Motivation

The mocked service integrations feature for Step Functions testing was recently introduced, but no telemetry was added to track its usage. These changes address that gap by recording usage of the associated environment variable, SFN_MOCK_CONFIG, without capturing the file path it points to.
Moreover, these changes introduce a new analytics usage counter to capture the number of executions by type. For now, this includes the is_mock_test_case execution type, which counts whether an execution is initiated as part of a mocked service integrations test case. This counter may be extended to later include additional labels to distinguish between standard and express executions.

Changes

  • added SFN_MOCK_CONFIG to analytics' PRESENCE_ENV_VAR list
  • added new execution_type_counter usage counter
  • added counting of is_mock_test_case upon start_execution calls

@MEPalma MEPalma added this to the 4.4 milestone May 5, 2025
@MEPalma MEPalma requested a review from joe4dev May 5, 2025 15:02
@MEPalma MEPalma self-assigned this May 5, 2025
@MEPalma MEPalma added the semver: minor Non-breaking changes which can be included in minor releases, but not in patch releases label May 5, 2025
Copy link

github-actions bot commented May 5, 2025

S3 Image Test Results (AMD64 / ARM64)

  2 files  ±0    2 suites  ±0   8m 39s ⏱️ -17s
488 tests ±0  438 ✅ ±0   50 💤 ±0  0 ❌ ±0 
976 runs  ±0  876 ✅ ±0  100 💤 ±0  0 ❌ ±0 

Results for commit a50fcba. ± Comparison against base commit 9b55514.

♻️ This comment has been updated with latest results.

@MEPalma MEPalma changed the title Step Functions: Add Telemetry for SFN_MOCK_CONFIG Usage Step Functions: Add Telemetry for Mocked Integrations May 5, 2025
Copy link

github-actions bot commented May 5, 2025

LocalStack Community integration with Pro

    2 files  ±0      2 suites  ±0   1h 42m 27s ⏱️ +44s
4 399 tests ±0  4 037 ✅ ±0  362 💤 ±0  0 ❌ ±0 
4 401 runs  ±0  4 037 ✅ ±0  364 💤 ±0  0 ❌ ±0 

Results for commit a50fcba. ± Comparison against base commit 9b55514.

@MEPalma MEPalma marked this pull request as ready for review May 5, 2025 17:40
@MEPalma MEPalma requested review from gregfurman and thrau as code owners May 5, 2025 17:40
@MEPalma MEPalma removed request for thrau and gregfurman May 5, 2025 17:40
Copy link
Member

@joe4dev joe4dev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some questions and discussion items regarding the optimal place of instrumentation and the trade-off regarding feature/parameter vs. operation tracking.

@@ -6,6 +6,7 @@
import time
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think the new standard name is analytics.py (following the Notion guide and samples in API Gateway, Lambda, etc)

@@ -789,6 +790,10 @@ def start_execution(
state_machine_arn_parts[1] if len(state_machine_arn_parts) == 2 else None
)

# Count metrics about the execution type.
is_mock_test_case: bool = mock_test_case_name is not None
UsageMetrics.execution_type_counter.labels(is_mock_test_case=is_mock_test_case).increment()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Is this the right place to increment the counter given we could raise multiple exceptions following this counter (e.g., _raise_state_machine_does_not_exist or InvalidExecutionInput)?


# Initialize a counter to record the use of each execution type.
execution_type_counter = Counter(
namespace="stepfunctions", name="execution_type", labels=["is_mock_test_case"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naming: Isn't execution sufficient because the _type sounds like an aspect that a label should cover. See "Metric & Label Best Practices" in Notion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion/idea: Following the Lambda example in localstack-core/localstack/services/lambda_/analytics.py:function_counter, what about the following focusing on the main operation(s) rather than having feature-specific counters (e.g., JSONata, is_mock_test_case, workflow_type, invocation_type, ...):

execution_counter = Counter(
    namespace="stepfunctions"
    name="execution",
    labels=[
        "is_mock_test_case",
        "workflow_type",  # standard | express
        "invocation_type",  # async | sync
    ],
)

Is it worthwhile (i.e., can we derive something actionable) from capturing more metadata linked to executions?
In Lambda, we even use an operation field to capture create and invoke activities separately. Feel free to weigh in on what's the best model for StepFunctions here.

sidenote: The dimension status might be interesting to learn about unexpected/unhandled errors (i.e., try/catch), the more metadata we collect about executions. However, if all executions happen through the API, unhandled exceptions should be tracked generically by our ASF-based telemetry (including actionable stack traces in DEBUG mode).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is your recommendation/guidance from the data side @vittoriopolverino, especially regarding the trade-off: counters for features/parameters (e.g., JSONata, is_mock_test_case, workflow_type, invocation_type) vs. for operations (e.g., execute, create) with features as labels?

@@ -789,6 +790,10 @@ def start_execution(
state_machine_arn_parts[1] if len(state_machine_arn_parts) == 2 else None
)

# Count metrics about the execution type.
is_mock_test_case: bool = mock_test_case_name is not None
UsageMetrics.execution_type_counter.labels(is_mock_test_case=is_mock_test_case).increment()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Are we missing other entry points, such as start_sync_execution? Are they worth tracking?

@MEPalma MEPalma modified the milestones: 4.4, Playground May 6, 2025
@MEPalma
Copy link
Contributor Author

MEPalma commented May 6, 2025

Thanks @joe4dev for sharing your thoughts!
The objective here was to quickly introduce a basic mechanism to track usage of this new feature in time for the upcoming release. Hence, optimality of this solution was not a goal and aimed to minimize changes ahead of the release. That said, I agree that implementing a more robust and optimal tracking strategy is worth exploring whilst introducing a new counter.
For now, I’m opening a separate PR that only tracks the use of the environment variable, and I’ll continue the broader discussion in this thread.

@MEPalma MEPalma marked this pull request as draft May 6, 2025 09:34
@MEPalma MEPalma changed the title Step Functions: Add Telemetry for Mocked Integrations [WIP] Step Functions: Telemetry May 6, 2025
@MEPalma
Copy link
Contributor Author

MEPalma commented May 6, 2025

#12584

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
semver: minor Non-breaking changes which can be included in minor releases, but not in patch releases
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants