Skip to content

KAFKA-19599: Reduce the frequency of ReplicaNotAvailableException thrown to clients when RLMM is not ready #20345

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: trunk
Choose a base branch
from

Conversation

kamalcph
Copy link
Contributor

@kamalcph kamalcph commented Aug 12, 2025

During broker restarts, the topic-based RemoteLogMetadataManager (RLMM)
constructs the state by reading the internal __remote_log_metadata
topic. When the partition is not ready to perform remote storage
operations, then ReplicaNotAvailableException thrown back to the
consumer. The clients retries the request immediately.

This results in a lot of FETCH requests on the broker and utilizes the
request handler threads. Using the CountdownLatch to reduce the
frequency of ReplicaNotAvailableException thrown back to the clients.
This will improve the request handler thread usage on the broker.

Previously for one consumer, when RLMM is not ready for a partition,
then ~9K FetchConsumer requests / sec are received on the broker. With
this patch, the number of FETCH requests reduced by 95% to 600 / sec.

…own to clients when RLMM is not ready

During broker restarts, the topic-based RemoteLogMetadataManager (RLMM) constructs the state by reading the internal __remote_log_metadata topic. When the partition is not ready to perform remote storage operations, then ReplicaNotAvailableException thrown back to the consumer. The clients retries the request immediately.

This results in a lot of FETCH requests on the broker and utilizes the request handler threads. Using the CountdownLatch to reduce the frequency of ReplicaNotAvailableException thrown back to the clients. This will improve the request handler thread usage on the broker.

Previously, when RLMM is not ready for a partition, then ~9K FetchConsumer requests / sec are received on the broker. With this patch, the number of FETCH requests come down to 600 / sec.
@github-actions github-actions bot added triage PRs from the community storage Pull requests that target the storage module tiered-storage Related to the Tiered Storage feature small Small PRs labels Aug 12, 2025
@kamalcph
Copy link
Contributor Author

@satishd @showuon

Call for review. PTAL. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
small Small PRs storage Pull requests that target the storage module tiered-storage Related to the Tiered Storage feature triage PRs from the community
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant