Skip to content

Migrate legacy usage counters to new analytic counters #12412

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Mar 25, 2025

Conversation

joe4dev
Copy link
Member

@joe4dev joe4dev commented Mar 19, 2025

Motivation

Migrate Lambda and Lambda Event Source Mapping counters to our new standard.

Changes

  • Remove legacy counters
  • Add hotreload counter for tracking how many Lambda functions are created with hot reloading enabled
  • Add function counter for tracking Lambda functions by:
    • operation: invoke
    • runtime: AWS Lambda runtime such as python3.12 or n/a if package_type is Image
    • invocation_type: Event | RequestResponse
    • status: trying out a typed definition for distinguishing different error types
    • package_type: Zip | Image
  • Add esm counter for tracking Lambda Event Source Mappings by:
    • status: success | partial_batch_failure_error | batch_failure_error | unhandled_error
    • source: e.g., aws:sqs

Testing

  • hotreload: aws.services.lambda_.test_lambda_developer_tools.TestHotReloading.test_hot_reloading
  • function:
    • invocation_type:
      • aws.services.lambda_.test_lambda.TestLambdaFeatures.test_invocation_type_request_response
      • aws.services.lambda_.test_lambda.TestLambdaFeatures.test_invocation_type_event
    • runtime: aws.services.lambda_.test_lambda_common.TestLambdaRuntimesCommon.test_echo_invoke
    • status: many scenarios in test_lambda.py
  • esm: tests/aws/services/lambda_/event_source_mapping

Questions

a) hotreload>operation: Should we use the operations label (e.g., create) if we envison that at some point we might want to track other operations (e.g., reload). Unfortunately, that's a bit tricky in this case because we would need to do that in our custom Lambda RIE Golang codebase.
b) function>operation: Should we add an operations label (e.g., invoke) if we envison to track create as well? If so, how to treat the invocation type label, which only applies for the invoke operation?
c) function>status: What do you think about using typed definitions for such a status label? Are we too fine-grained here?

@joe4dev joe4dev added the semver: patch Non-breaking changes which can be included in patch releases label Mar 19, 2025
@joe4dev joe4dev added this to the 4.3 milestone Mar 19, 2025
@joe4dev joe4dev self-assigned this Mar 19, 2025
Copy link

github-actions bot commented Mar 19, 2025

LocalStack Community integration with Pro

    2 files  ±    0      2 suites  ±0   1h 30m 8s ⏱️ - 25m 23s
3 143 tests  - 1 162  2 924 ✅  - 1 060  219 💤  - 102  0 ❌ ±0 
3 145 runs   - 1 162  2 924 ✅  - 1 060  221 💤  - 102  0 ❌ ±0 

Results for commit e5bb66c. ± Comparison against base commit 68df8dc.

This pull request removes 1162 tests.
tests.aws.scenario.bookstore.test_bookstore.TestBookstoreApplication ‑ test_lambda_dynamodb
tests.aws.scenario.bookstore.test_bookstore.TestBookstoreApplication ‑ test_opensearch_crud
tests.aws.scenario.bookstore.test_bookstore.TestBookstoreApplication ‑ test_search_books
tests.aws.scenario.bookstore.test_bookstore.TestBookstoreApplication ‑ test_setup
tests.aws.scenario.kinesis_firehose.test_kinesis_firehose.TestKinesisFirehoseScenario ‑ test_kinesis_firehose_s3
tests.aws.scenario.lambda_destination.test_lambda_destination_scenario.TestLambdaDestinationScenario ‑ test_destination_sns
tests.aws.scenario.lambda_destination.test_lambda_destination_scenario.TestLambdaDestinationScenario ‑ test_infra
tests.aws.scenario.loan_broker.test_loan_broker.TestLoanBrokerScenario ‑ test_prefill_dynamodb_table
tests.aws.scenario.loan_broker.test_loan_broker.TestLoanBrokerScenario ‑ test_stepfunctions_input_recipient_list[step_function_input0-SUCCEEDED]
tests.aws.scenario.loan_broker.test_loan_broker.TestLoanBrokerScenario ‑ test_stepfunctions_input_recipient_list[step_function_input1-SUCCEEDED]
…

♻️ This comment has been updated with latest results.

@joe4dev joe4dev marked this pull request as ready for review March 19, 2025 16:05
@vittoriopolverino
Copy link
Member

vittoriopolverino commented Mar 19, 2025

a) hotreload>operation: Should we use the operations label (e.g., create) if we envison that at some point we might want to track other operations (e.g., reload). Unfortunately, that's a bit tricky in this case because we would need to do that in our custom Lambda RIE Golang codebase.

I’d suggest yes if create and reload, or other operations in general will share common characteristics and, in the future you want to track them in a more fine-grained way with additional labels/dimensions.

b) function>operation: Should we add an operations label (e.g., invoke) if we envison to track create as well? If so, how to treat the invocation type label, which only applies for the invoke operation?

That’s a tough one, and it ties back to question (a) 🤣

If invocation_type is the only label that isn’t shared across the different operations you want to track, we could compromise by leaving invocation_type empty for create events.

However, if you anticipate that different operations will diverge and require operation-specific labels, then I’d suggest creating two separate metrics instead of using an operation label, even if the two metrics share some common characteristics.

c) function>status: What do you think about using typed definitions for such a status label? Are we too fine-grained here?

What’s the cardinality here, this? If this is the cardinality, it seems reasonable to track it, as long as it’s valuable for you. Additionally, if everything != 'success' can be considered an error, filtering for error statuses should be straightforward. However, grouping them might be less convenient, in that case, we would need to introduce a general error attribute to group all error events != 'success'.


Additional Note:
If it helps with your decision, I totally understand that it’s often difficult to anticipate how the measurement of a feature/service might evolve over time. That’s why I’m considering introducing a schema versioning mechanism to help mitigate this issue—allowing us to apply corrections to label usage (e.g., deprecations, additions, modifications, etc.).

That said, I wouldn’t want this to be over-relied upon or used as a fallback for every case. Ideally, we should still aim to anticipate future needs as much as possible.

If you’d like, we can have a quick chat and run some queries together to test different combinations or possibilities. That might give me a better understanding of your intentions and help refine the approach 💪🏼

@tiurin
Copy link
Contributor

tiurin commented Mar 19, 2025

However, grouping them might be less convenient, in that case, we would need to introduce a general error attribute to group all error events != 'success'.

@vittoriopolverino to confirm, do you suggest here to have 2 separate labels - one for general operation result, i.e. success/error, and another one for specific reason for error?

@tiurin
Copy link
Contributor

tiurin commented Mar 19, 2025

@joe4dev @vittoriopolverino I added a draft PR for EventBridge Pipes: https://github.com/localstack/localstack-ext/pull/4224
Since Pipes and ESM share a part of infrastructure, namely pollers, we need to coordinate both changes and maybe unify enums like status, wdyt?

@vittoriopolverino
Copy link
Member

to confirm, do you suggest here to have 2 separate labels - one for general operation result, i.e. success/error, and another one for specific reason for error?

Yes 👍🏼

  • Advantage: Having both successful and failed invocations in a single metric, since they share more common characteristics than differences.
  • Disadvantage: For successful invocations, the error reason label will be empty, but that’s an acceptable trade-off in this case.

@vittoriopolverino
Copy link
Member

vittoriopolverino commented Mar 20, 2025

I added a draft PR for EventBridge Pipes: https://github.com/localstack/localstack-ext/pull/4224
Since Pipes and ESM share a part of infrastructure, namely pollers, we need to coordinate both changes and maybe unify enums like status, wdyt?

From a data modeling perspective, if there are enums having the same meaning in both EventBridge Pipes and ESM, it would make sense to ensure consistency across them.

That way, you can use the same filter to reference a status across both Pipes and ESM, instead of needing two separate filters to represent the same status. (Not sure if this kind of cross-service analysis is something you’d ever do, but in any case, if we can ensure consistency, it’s better)

@joe4dev
Copy link
Member Author

joe4dev commented Mar 20, 2025

to confirm, do you suggest here to have 2 separate labels - one for general operation result, i.e. success/error, and another one for specific reason for error?

Yes 👍🏼

* Advantage: Having both successful and failed invocations in a single metric, since they share more common characteristics than differences.

* Disadvantage: For successful invocations, the error reason label will be empty, but that’s an acceptable trade-off in this case.

One of the main question for me is what do we categorize as "error":

  • AWS API error responses can be expected errors and part of the intended application behavior (e.g., partial batch failure handling in Lambda ESM because an application entry fails and then gets retried). These errors are expected AWS behavior and from a LocalStack perspective not actionable errors because we correctly implement AWS error parity.
  • Unhandled errors are actionable for LocalStack because they indicate bugs in LocalStack and can guide maintenance effort to certain code paths (e.g., unhandled_state_error).
  • Some errors can be user-side errors (e.g., pending_state_error: invoke a Lambda function that is not yet active)
  • For some errors, we have insufficient information to categorize them and they could be user-side or LocalStack bugs (e.g., invocation_error or failed_state_error)

@vittoriopolverino @tiurin What is your understanding/suggestion regarding what should be considered an "error"?

@vittoriopolverino
Copy link
Member

vittoriopolverino commented Mar 20, 2025

What is your understanding/suggestion regarding what should be considered an "error"?

🤔 If an error is expected (part of intended application behavior), doesn’t require intervention, and isn’t something we need to act on, then maybe we need a way to differentiate errors.

We have two options:
1️⃣ Use a single status label with three or more categories: success, actionable_error, and non_actionable_error, etc. The downside is that this makes it harder to filter all errors under a single error status.
2️⃣ Introduce an additional label, like is_error_actionable (true/false), while keeping status=error. This keeps all errors grouped but adds a second error-specific label alongside error_details.

Which approach makes the most sense really depends on the query patterns you/we anticipate using the most in the future. I don’t think there’s a strictly right or wrong choice between the two, it’s more about what will make querying and analysis more efficient for your use cases

@tiurin
Copy link
Contributor

tiurin commented Mar 20, 2025

From a data modeling perspective, if there are enums having the same meaning in both EventBridge Pipes and ESM, it would make sense to ensure consistency across them.

Pipes is a more broad case of ESM, moreover they have some shared code, so if we can align both well we will avoid "if pipes" conditions for counters.

@joe4dev joe4dev marked this pull request as draft March 20, 2025 14:21
@joe4dev joe4dev force-pushed the migrate-lambda-analytic-counters branch from a5a534b to 2bb900b Compare March 21, 2025 14:40
@joe4dev joe4dev force-pushed the migrate-lambda-analytic-counters branch from 2bb900b to cfb8d1e Compare March 21, 2025 14:45
@joe4dev
Copy link
Member Author

joe4dev commented Mar 21, 2025

Background on trade-off discussions:

  1. When to introduce operation-specific counters given the tradeoff n/a fields vs duplicated fields? For example, Lambda function creation and invocation has some shared (e.g., runtime) and some operation-specific (e.g., invocation_type) fields.

Most dimensions seem relevant for both create and invoke operations, which allows us to compare usage of the control vs. data plane. We can live with the invocation_type field using n/a upon the create operation.

  1. How to decide whether an additional top-level is_error field (i.e., status field) is helpful. Especially when things might not add up to 100% in ESM/Pipes.

In ESM, Pipes, and Lambda, it's not clearly possible to distinguish actionable and user-side errors (e.g., invocation_error can have multiple root causes). Therefore, we couldn't clearly identify high-level error categories and a boolean flag is_error could be misleading because in ESM and Pipes, poller errors happen in a different part of the system and don't naturally complement the success case (which happens only if records are available and a pipe execution or ESM execution is triggered).

@joe4dev joe4dev marked this pull request as ready for review March 21, 2025 15:00
Copy link
Contributor

@tiurin tiurin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ESM part ✅ 🚀

Copy link
Member

@vittoriopolverino vittoriopolverino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🥳 Fantastic work, really appreciate the deep thinking that went into using labels effectively. Some great discussions, questions, and insights came up while applying the new counter and structuring the labels 🚀

Copy link
Member

@dfangl dfangl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we need a similar or "n/a" for the invokes in the function counters as well, not just the creation?

@joe4dev
Copy link
Member Author

joe4dev commented Mar 24, 2025

Don't we need a similar or "n/a" for the invokes in the function counters as well, not just the creation?

absolutely spot on @dfangl 💯 Thank you!

I added the or "n/a" for invokes in the lambda_service (sync) and event_manager (async) 👍

Copy link
Contributor

@gregfurman gregfurman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I think we should consider adding a counter for cold vs. warm starts since it appears to have an influence on performance of ESM. Not necessary in this PR though 🙂

@joe4dev joe4dev merged commit 08382d0 into master Mar 25, 2025
31 checks passed
@joe4dev joe4dev deleted the migrate-lambda-analytic-counters branch March 25, 2025 14:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
semver: patch Non-breaking changes which can be included in patch releases
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants