Skip to content

[WIP] Step Functions: Telemetry #12581

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions localstack-core/localstack/runtime/analytics.py
Original file line number Diff line number Diff line change
Expand Up @@ -85,6 +85,7 @@
"OUTBOUND_HTTP_PROXY",
"OUTBOUND_HTTPS_PROXY",
"S3_DIR",
"SFN_MOCK_CONFIG",
"TMPDIR",
]

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
import time
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think the new standard name is analytics.py (following the Notion guide and samples in API Gateway, Lambda, etc)

from typing import Final, Optional

import localstack.services.stepfunctions.usage as UsageMetrics
from localstack.aws.api import CommonServiceException, RequestContext
from localstack.aws.api.stepfunctions import (
ActivityDoesNotExist,
Expand Down Expand Up @@ -789,6 +790,10 @@ def start_execution(
state_machine_arn_parts[1] if len(state_machine_arn_parts) == 2 else None
)

# Count metrics about the execution type.
is_mock_test_case: bool = mock_test_case_name is not None
UsageMetrics.execution_type_counter.labels(is_mock_test_case=is_mock_test_case).increment()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Is this the right place to increment the counter given we could raise multiple exceptions following this counter (e.g., _raise_state_machine_does_not_exist or InvalidExecutionInput)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Are we missing other entry points, such as start_sync_execution? Are they worth tracking?


store = self.get_store(context=context)

alias: Optional[Alias] = store.aliases.get(state_machine_arn)
Expand Down
5 changes: 5 additions & 0 deletions localstack-core/localstack/services/stepfunctions/usage.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,3 +10,8 @@
name="language_features_used",
labels=["query_language", "uses_variables"],
)

# Initialize a counter to record the use of each execution type.
execution_type_counter = Counter(
namespace="stepfunctions", name="execution_type", labels=["is_mock_test_case"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naming: Isn't execution sufficient because the _type sounds like an aspect that a label should cover. See "Metric & Label Best Practices" in Notion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion/idea: Following the Lambda example in localstack-core/localstack/services/lambda_/analytics.py:function_counter, what about the following focusing on the main operation(s) rather than having feature-specific counters (e.g., JSONata, is_mock_test_case, workflow_type, invocation_type, ...):

execution_counter = Counter(
    namespace="stepfunctions"
    name="execution",
    labels=[
        "is_mock_test_case",
        "workflow_type",  # standard | express
        "invocation_type",  # async | sync
    ],
)

Is it worthwhile (i.e., can we derive something actionable) from capturing more metadata linked to executions?
In Lambda, we even use an operation field to capture create and invoke activities separately. Feel free to weigh in on what's the best model for StepFunctions here.

sidenote: The dimension status might be interesting to learn about unexpected/unhandled errors (i.e., try/catch), the more metadata we collect about executions. However, if all executions happen through the API, unhandled exceptions should be tracked generically by our ASF-based telemetry (including actionable stack traces in DEBUG mode).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is your recommendation/guidance from the data side @vittoriopolverino, especially regarding the trade-off: counters for features/parameters (e.g., JSONata, is_mock_test_case, workflow_type, invocation_type) vs. for operations (e.g., execute, create) with features as labels?

)
Loading