Skip to content

Improve RedisCluster slot caching and invalidation. #2655

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 6 commits into
base: develop
Choose a base branch
from

Conversation

michael-grunder
Copy link
Member

This PR contains a few changes to our slot caching mechanism and cluster remapping.

  1. Elevate CLUSTERDOWN to an event that triggers a remapping of the keyspace. This can occur for example right after a node is removed from the cluster via CLUSTER FORGET. In addition, we now immediately remap the keyspace in this situation.
  2. Add a debug function to allow the user to flush the persistent slot cache manually. Right now this would only affect the individual worker (or cli) process so may have limited utility initially.
  3. Add a new ini setting to tell RedisCluster that the slot cache should expire after some number of seconds. This ensures that the keyspace is remapped even after non-destructive changes are made to the cluster such as a new replica being brought online. new replicas would never cause MOVED or ASKING responses so PhpRedis wouldn't know they were avaialble.

When a replica node is taken offline, it will briefly respond to
commands with some form of `CLUSTERDOWN` error. Presently while this did
cause PhpRedis to invalidate the slot cache, we did not immediately
remap the topology.

What this meant, is that for an already connected cluster, we might keep
sending readonly commands to this now decomissioned cluster replica for
the duration of the request.

Additionally, if the request was long lived enough we may still have it
mapped if the replica was brought back online at which point the client
might see dozens of `LOADING` exceptions.

This commit just changes our behavior when we detect any `CLUSTERDOWN`
error such that we flush the slot cache, disconnect the particular node,
and also attempt a remap of the topology.
This commit adds a new `ini` setting `redis.clusters.slot_cache_expiry`
which specifies in seconds how long a cached slot map should live before
being refreshed. The setting is disabled by default to avoid breaking
backward compatibility.
@michael-grunder michael-grunder force-pushed the fix/slot-cache-invalidation branch 3 times, most recently from b609b64 to 86391ec Compare May 1, 2025 01:53
@michael-grunder michael-grunder force-pushed the fix/slot-cache-invalidation branch from 86391ec to 2ecdd3f Compare May 1, 2025 02:34
This commit adds a mechanism to invalidate every `RedisCluster` slot
cache across all forked workers with one call to any of the workers.

The invalidation itslef works by using a `uint64_t` generation counter
which is allocated in shared memory on `MINIT` which is stored in every
slot cache when it is created.

When a user wants to invalidate any of the slot caches they can call
`RedisCluster::invalidateSlotCaches()` which simply does an atomic
increment on this global genertion value.

When a slot cache is retreived from the persistent list, the internal
generation is compared to the global shared memory generation and if
they don't match, the cache is not used, which is the same as
invalidting it as RedisCluster will write a new slot cache on object
destruction.

The feature is only compiled if `config.m4` determines that using `mmap`
to allocate shared memory is available and works.

Additionally, the feature must be explicitly enabled by setting the
`redis.clusters.shared_slot_cache_invalidation`.
@michael-grunder michael-grunder force-pushed the fix/slot-cache-invalidation branch from 2ecdd3f to fead6c7 Compare May 1, 2025 02:48
@michael-grunder michael-grunder force-pushed the fix/slot-cache-invalidation branch from dc6e530 to 2e38b5c Compare May 1, 2025 18:44
`time(NULL)` isn't monotonic and can run backwards so switch to
`zend_hrtime` which uses `clock_gettime(CLOCK_MONOTONIC)` on Linux and
`QueryPerformanceCounter`on Windows.
@michael-grunder michael-grunder force-pushed the fix/slot-cache-invalidation branch from 2e38b5c to f01f805 Compare May 1, 2025 19:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants