Migrate legacy usage counters to new analytic counters #12412

joe4dev · 2025-03-19T13:28:52Z

Motivation

Migrate Lambda and Lambda Event Source Mapping counters to our new standard.

Changes

Remove legacy counters
Add hotreload counter for tracking how many Lambda functions are created with hot reloading enabled
Add function counter for tracking Lambda functions by:
- operation: invoke
- runtime: AWS Lambda runtime such as python3.12 or n/a if package_type is Image
- invocation_type: Event | RequestResponse
- status: trying out a typed definition for distinguishing different error types
- package_type: Zip | Image
Add esm counter for tracking Lambda Event Source Mappings by:
- status: success | partial_batch_failure_error | batch_failure_error | unhandled_error
- source: e.g., aws:sqs

Testing

hotreload: aws.services.lambda_.test_lambda_developer_tools.TestHotReloading.test_hot_reloading
- Manuel: Create a Lambda function with hot reloading enabled: https://docs.localstack.cloud/user-guide/lambda-tools/hot-reloading/#hot-reloading-for-python-lambdas
function:
- invocation_type:
  - aws.services.lambda_.test_lambda.TestLambdaFeatures.test_invocation_type_request_response
  - aws.services.lambda_.test_lambda.TestLambdaFeatures.test_invocation_type_event
- runtime: aws.services.lambda_.test_lambda_common.TestLambdaRuntimesCommon.test_echo_invoke
- status: many scenarios in test_lambda.py
esm: tests/aws/services/lambda_/event_source_mapping

Questions

a) hotreload>operation: Should we use the operations label (e.g., create) if we envison that at some point we might want to track other operations (e.g., reload). Unfortunately, that's a bit tricky in this case because we would need to do that in our custom Lambda RIE Golang codebase.
b) function>operation: Should we add an operations label (e.g., invoke) if we envison to track create as well? If so, how to treat the invocation type label, which only applies for the invoke operation?
c) function>status: What do you think about using typed definitions for such a status label? Are we too fine-grained here?

github-actions · 2025-03-19T14:28:43Z

LocalStack Community integration with Pro

2 files ± 0 2 suites ±0 1h 30m 8s ⏱️ - 25m 23s
3 143 tests - 1 162 2 924 ✅ - 1 060 219 💤 - 102 0 ❌ ±0
3 145 runs - 1 162 2 924 ✅ - 1 060 221 💤 - 102 0 ❌ ±0

Results for commit e5bb66c. ± Comparison against base commit 68df8dc.

This pull request removes 1162 tests.

tests.aws.scenario.bookstore.test_bookstore.TestBookstoreApplication ‑ test_lambda_dynamodb
tests.aws.scenario.bookstore.test_bookstore.TestBookstoreApplication ‑ test_opensearch_crud
tests.aws.scenario.bookstore.test_bookstore.TestBookstoreApplication ‑ test_search_books
tests.aws.scenario.bookstore.test_bookstore.TestBookstoreApplication ‑ test_setup
tests.aws.scenario.kinesis_firehose.test_kinesis_firehose.TestKinesisFirehoseScenario ‑ test_kinesis_firehose_s3
tests.aws.scenario.lambda_destination.test_lambda_destination_scenario.TestLambdaDestinationScenario ‑ test_destination_sns
tests.aws.scenario.lambda_destination.test_lambda_destination_scenario.TestLambdaDestinationScenario ‑ test_infra
tests.aws.scenario.loan_broker.test_loan_broker.TestLoanBrokerScenario ‑ test_prefill_dynamodb_table
tests.aws.scenario.loan_broker.test_loan_broker.TestLoanBrokerScenario ‑ test_stepfunctions_input_recipient_list[step_function_input0-SUCCEEDED]
tests.aws.scenario.loan_broker.test_loan_broker.TestLoanBrokerScenario ‑ test_stepfunctions_input_recipient_list[step_function_input1-SUCCEEDED]
…

♻️ This comment has been updated with latest results.

vittoriopolverino · 2025-03-19T17:16:18Z

a) hotreload>operation: Should we use the operations label (e.g., create) if we envison that at some point we might want to track other operations (e.g., reload). Unfortunately, that's a bit tricky in this case because we would need to do that in our custom Lambda RIE Golang codebase.

I’d suggest yes if create and reload, or other operations in general will share common characteristics and, in the future you want to track them in a more fine-grained way with additional labels/dimensions.

b) function>operation: Should we add an operations label (e.g., invoke) if we envison to track create as well? If so, how to treat the invocation type label, which only applies for the invoke operation?

That’s a tough one, and it ties back to question (a) 🤣

If invocation_type is the only label that isn’t shared across the different operations you want to track, we could compromise by leaving invocation_type empty for create events.

However, if you anticipate that different operations will diverge and require operation-specific labels, then I’d suggest creating two separate metrics instead of using an operation label, even if the two metrics share some common characteristics.

c) function>status: What do you think about using typed definitions for such a status label? Are we too fine-grained here?

What’s the cardinality here, this? If this is the cardinality, it seems reasonable to track it, as long as it’s valuable for you. Additionally, if everything != 'success' can be considered an error, filtering for error statuses should be straightforward. However, grouping them might be less convenient, in that case, we would need to introduce a general error attribute to group all error events != 'success'.

Additional Note:
If it helps with your decision, I totally understand that it’s often difficult to anticipate how the measurement of a feature/service might evolve over time. That’s why I’m considering introducing a schema versioning mechanism to help mitigate this issue—allowing us to apply corrections to label usage (e.g., deprecations, additions, modifications, etc.).

That said, I wouldn’t want this to be over-relied upon or used as a fallback for every case. Ideally, we should still aim to anticipate future needs as much as possible.

If you’d like, we can have a quick chat and run some queries together to test different combinations or possibilities. That might give me a better understanding of your intentions and help refine the approach 💪🏼

tiurin · 2025-03-19T20:46:00Z

However, grouping them might be less convenient, in that case, we would need to introduce a general error attribute to group all error events != 'success'.

@vittoriopolverino to confirm, do you suggest here to have 2 separate labels - one for general operation result, i.e. success/error, and another one for specific reason for error?

tiurin · 2025-03-19T21:29:37Z

@joe4dev @vittoriopolverino I added a draft PR for EventBridge Pipes: https://github.com/localstack/localstack-ext/pull/4224
Since Pipes and ESM share a part of infrastructure, namely pollers, we need to coordinate both changes and maybe unify enums like status, wdyt?

vittoriopolverino · 2025-03-20T08:50:35Z

to confirm, do you suggest here to have 2 separate labels - one for general operation result, i.e. success/error, and another one for specific reason for error?

Yes 👍🏼

Advantage: Having both successful and failed invocations in a single metric, since they share more common characteristics than differences.
Disadvantage: For successful invocations, the error reason label will be empty, but that’s an acceptable trade-off in this case.

vittoriopolverino · 2025-03-20T09:15:43Z

I added a draft PR for EventBridge Pipes: https://github.com/localstack/localstack-ext/pull/4224
Since Pipes and ESM share a part of infrastructure, namely pollers, we need to coordinate both changes and maybe unify enums like status, wdyt?

From a data modeling perspective, if there are enums having the same meaning in both EventBridge Pipes and ESM, it would make sense to ensure consistency across them.

That way, you can use the same filter to reference a status across both Pipes and ESM, instead of needing two separate filters to represent the same status. (Not sure if this kind of cross-service analysis is something you’d ever do, but in any case, if we can ensure consistency, it’s better)

joe4dev · 2025-03-20T09:16:16Z

to confirm, do you suggest here to have 2 separate labels - one for general operation result, i.e. success/error, and another one for specific reason for error?

Yes 👍🏼
* Advantage: Having both successful and failed invocations in a single metric, since they share more common characteristics than differences.

* Disadvantage: For successful invocations, the error reason label will be empty, but that’s an acceptable trade-off in this case.

One of the main question for me is what do we categorize as "error":

AWS API error responses can be expected errors and part of the intended application behavior (e.g., partial batch failure handling in Lambda ESM because an application entry fails and then gets retried). These errors are expected AWS behavior and from a LocalStack perspective not actionable errors because we correctly implement AWS error parity.
Unhandled errors are actionable for LocalStack because they indicate bugs in LocalStack and can guide maintenance effort to certain code paths (e.g., unhandled_state_error).
Some errors can be user-side errors (e.g., pending_state_error: invoke a Lambda function that is not yet active)
For some errors, we have insufficient information to categorize them and they could be user-side or LocalStack bugs (e.g., invocation_error or failed_state_error)

@vittoriopolverino @tiurin What is your understanding/suggestion regarding what should be considered an "error"?

vittoriopolverino · 2025-03-20T10:44:03Z

What is your understanding/suggestion regarding what should be considered an "error"?

🤔 If an error is expected (part of intended application behavior), doesn’t require intervention, and isn’t something we need to act on, then maybe we need a way to differentiate errors.

We have two options:
1️⃣ Use a single status label with three or more categories: success, actionable_error, and non_actionable_error, etc. The downside is that this makes it harder to filter all errors under a single error status.
2️⃣ Introduce an additional label, like is_error_actionable (true/false), while keeping status=error. This keeps all errors grouped but adds a second error-specific label alongside error_details.

Which approach makes the most sense really depends on the query patterns you/we anticipate using the most in the future. I don’t think there’s a strictly right or wrong choice between the two, it’s more about what will make querying and analysis more efficient for your use cases

tiurin · 2025-03-20T11:02:41Z

From a data modeling perspective, if there are enums having the same meaning in both EventBridge Pipes and ESM, it would make sense to ensure consistency across them.

Pipes is a more broad case of ESM, moreover they have some shared code, so if we can align both well we will avoid "if pipes" conditions for counters.

joe4dev · 2025-03-21T14:49:41Z

Background on trade-off discussions:

When to introduce operation-specific counters given the tradeoff n/a fields vs duplicated fields? For example, Lambda function creation and invocation has some shared (e.g., runtime) and some operation-specific (e.g., invocation_type) fields.

Most dimensions seem relevant for both create and invoke operations, which allows us to compare usage of the control vs. data plane. We can live with the invocation_type field using n/a upon the create operation.

How to decide whether an additional top-level is_error field (i.e., status field) is helpful. Especially when things might not add up to 100% in ESM/Pipes.

In ESM, Pipes, and Lambda, it's not clearly possible to distinguish actionable and user-side errors (e.g., invocation_error can have multiple root causes). Therefore, we couldn't clearly identify high-level error categories and a boolean flag is_error could be misleading because in ESM and Pipes, poller errors happen in a different part of the system and don't naturally complement the success case (which happens only if records are available and a pipe execution or ESM execution is triggered).

tiurin

ESM part ✅ 🚀

vittoriopolverino

LGTM 🥳 Fantastic work, really appreciate the deep thinking that went into using labels effectively. Some great discussions, questions, and insights came up while applying the new counter and structuring the labels 🚀

localstack-core/localstack/services/lambda_/provider.py

dfangl

Don't we need a similar or "n/a" for the invokes in the function counters as well, not just the creation?

joe4dev · 2025-03-24T15:40:54Z

Don't we need a similar or "n/a" for the invokes in the function counters as well, not just the creation?

absolutely spot on @dfangl 💯 Thank you!

I added the or "n/a" for invokes in the lambda_service (sync) and event_manager (async) 👍

gregfurman

LGTM. I think we should consider adding a counter for cold vs. warm starts since it appears to have an influence on performance of ESM. Not necessary in this PR though 🙂

joe4dev added the semver: patch Non-breaking changes which can be included in patch releases label Mar 19, 2025

joe4dev added this to the 4.3 milestone Mar 19, 2025

joe4dev self-assigned this Mar 19, 2025

joe4dev requested a review from vittoriopolverino March 19, 2025 13:29

joe4dev marked this pull request as ready for review March 19, 2025 16:05

joe4dev requested review from dominikschubert, dfangl and gregfurman as code owners March 19, 2025 16:05

joe4dev marked this pull request as draft March 20, 2025 14:21

joe4dev added 2 commits March 21, 2025 15:39

Migrate legacy usage counters to new analytic counters

3528b51

nit: fix comment placement

d513963

joe4dev force-pushed the migrate-lambda-analytic-counters branch from a5a534b to 2bb900b Compare March 21, 2025 14:40

Align ESM counters with Pipes and add CreateFunction tracking

cfb8d1e

joe4dev force-pushed the migrate-lambda-analytic-counters branch from 2bb900b to cfb8d1e Compare March 21, 2025 14:45

joe4dev marked this pull request as ready for review March 21, 2025 15:00

tiurin approved these changes Mar 21, 2025

View reviewed changes

vittoriopolverino approved these changes Mar 21, 2025

View reviewed changes

joe4dev commented Mar 24, 2025

View reviewed changes

localstack-core/localstack/services/lambda_/provider.py Outdated Show resolved Hide resolved

Handle runtime None values

09e96b5

dfangl reviewed Mar 24, 2025

View reviewed changes

joe4dev added 2 commits March 24, 2025 16:34

Handle runtime None value for function invokes

a39fb91

Handle runtime None values for async invokes

e5bb66c

gregfurman approved these changes Mar 25, 2025

View reviewed changes

joe4dev merged commit 08382d0 into master Mar 25, 2025
31 checks passed

joe4dev deleted the migrate-lambda-analytic-counters branch March 25, 2025 14:41

Uh oh!

Migrate legacy usage counters to new analytic counters #12412

Migrate legacy usage counters to new analytic counters #12412

Uh oh!

Conversation

joe4dev commented Mar 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Changes

Testing

Questions

Uh oh!

github-actions bot commented Mar 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

LocalStack Community integration with Pro

Uh oh!

vittoriopolverino commented Mar 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tiurin commented Mar 19, 2025

Uh oh!

tiurin commented Mar 19, 2025

Uh oh!

vittoriopolverino commented Mar 20, 2025

Uh oh!

vittoriopolverino commented Mar 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joe4dev commented Mar 20, 2025

Uh oh!

vittoriopolverino commented Mar 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tiurin commented Mar 20, 2025

Uh oh!

joe4dev commented Mar 21, 2025

Uh oh!

tiurin left a comment

Choose a reason for hiding this comment

Uh oh!

vittoriopolverino left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dfangl left a comment

Choose a reason for hiding this comment

Uh oh!

joe4dev commented Mar 24, 2025

Uh oh!

gregfurman left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

joe4dev commented Mar 19, 2025 •

edited

Loading

github-actions bot commented Mar 19, 2025 •

edited

Loading

vittoriopolverino commented Mar 19, 2025 •

edited

Loading

vittoriopolverino commented Mar 20, 2025 •

edited

Loading

vittoriopolverino commented Mar 20, 2025 •

edited

Loading

vittoriopolverino left a comment •

edited

Loading