Collection of test fixes (2025Q2, batch 1) (backport #14206) #14372

mergify · 2025-08-12T14:10:46Z

This pull request addresses the test flakes that appeared in the past couple months.

This is an automatic backport of pull request #14206 done by Mergify.

mergify · 2025-08-12T14:10:48Z

Cherry-pick of bbcd04d has failed:

On branch mergify/bp/v4.1.x/pr-14206
Your branch is up to date with 'origin/v4.1.x'.

You are currently cherry-picking commit bbcd04d93.
  (fix conflicts and run "git cherry-pick --continue")
  (use "git cherry-pick --skip" to skip this patch)
  (use "git cherry-pick --abort" to cancel the cherry-pick operation)

Unmerged paths:
  (use "git add <file>..." to mark resolution)
	both modified:   deps/rabbit/test/feature_flags_SUITE.erl

no changes added to commit (use "git add" and/or "git commit -a")

Cherry-pick of f973932 has failed:

On branch mergify/bp/v4.1.x/pr-14206
Your branch is ahead of 'origin/v4.1.x' by 1 commit.
  (use "git push" to publish your local commits)

You are currently cherry-picking commit f973932a2.
  (fix conflicts and run "git cherry-pick --continue")
  (use "git cherry-pick --skip" to skip this patch)
  (use "git cherry-pick --abort" to cancel the cherry-pick operation)

Changes to be committed:
	modified:   deps/rabbit/test/queue_utils.erl

Unmerged paths:
  (use "git add <file>..." to mark resolution)
	both modified:   deps/rabbit/test/quorum_queue_SUITE.erl

Cherry-pick of ab76698 has failed:

On branch mergify/bp/v4.1.x/pr-14206
Your branch is ahead of 'origin/v4.1.x' by 3 commits.
  (use "git push" to publish your local commits)

You are currently cherry-picking commit ab766981a.
  (fix conflicts and run "git cherry-pick --continue")
  (use "git cherry-pick --skip" to skip this patch)
  (use "git cherry-pick --abort" to cancel the cherry-pick operation)

Unmerged paths:
  (use "git add <file>..." to mark resolution)
	both modified:   deps/rabbit/test/cluster_minority_SUITE.erl

no changes added to commit (use "git add" and/or "git commit -a")

Cherry-pick of ce1545d has failed:

On branch mergify/bp/v4.1.x/pr-14206
Your branch is ahead of 'origin/v4.1.x' by 11 commits.
  (use "git push" to publish your local commits)

You are currently cherry-picking commit ce1545d51.
  (fix conflicts and run "git cherry-pick --continue")
  (use "git cherry-pick --skip" to skip this patch)
  (use "git cherry-pick --abort" to cancel the cherry-pick operation)

Unmerged paths:
  (use "git add <file>..." to mark resolution)
	both modified:   deps/rabbitmq_mqtt/test/java_SUITE.erl

no changes added to commit (use "git add" and/or "git commit -a")

To fix up this pull request, you can check it out locally. See documentation: https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/checking-out-pull-requests-locally

(cherry picked from commit b4cda4a)

…ader` [Why] When node A becomes the leader, it still takes a few exchanges with followers to allow cluster changes again. Concurrently, the testcase blocks traffic between A and other nodes to simulate a network partition. If this happens after A becomes the leader but before cluster changes are permitted again, the testcase will never succeed and will eventually abort with: Case: cluster_minority_SUITE:remove_node_when_seed_node_is_leader Reason: {error, {{awaitMatch, [{module,cluster_minority_SUITE}, {expression, "..."}, {pattern,"ok"}, {value, {error,69,<<"Error:\ncluster_change_not_permitted">>}}]}, [How] Before blocking traffic, we wait for cluster changes to be permitted again on node A. (cherry picked from commit 7166357)

... which is a previous copy of the variable. Also, while here, fix style: the line was breaking the 80-col limit. (cherry picked from commit 51aba85)

[Why] The default checkpoint interval is 16384. Therefore with 20,000 messages published by the testcase, there is a chance a checkpoint is created. This would hit an assertion in the testcase which expects no checkpoints before it forces the creation of one. We see this happening in CI. Not locally because the testcase runs fast enough. [How] The testcase now sends 10,000 messages. This is still a lot of messages while staying under the default checkpoint interval. (cherry picked from commit 1582ae6)

... in `remove_node_when_seed_node_is_leader/1` and `remove_node_when_seed_node_is_follower/1`. [Why] The check was performed after the partition so far. It was incorrect because if a cluster change was not permitted at the time of the partition, it would not be afterwards. Thus there was a race condition here. [How] Now, the check is performed before the partition. Thanks to this new approach, we are sure of the state of node A and don't need the cass block near the end of the test cases. This should fix some test flakes we see locally and in CI. (cherry picked from commit ab76698)

…de_list_in_user` [Why] This was the only place where a condition was checked once after a connection close, instead of waiting for it to become true. This caused some transient failures in CI when the connection tracking took a bit of time to update and the check was performed before that. (cherry picked from commit f613743)

[Why] The tests relied on `rabbit_ct_client_helpers` connection and channel manager which doesn't seem to be robust. It causes more harm than helps so far. Hopefully, this will fix some test flakes in CI. (cherry picked from commit 22c0959)

[Why] The tests relied on `rabbit_ct_client_helpers` connection and channel manager which doesn't seem to be robust. It causes more harm than helps so far. Hopefully, this will fix some test flakes in CI. (cherry picked from commit 0bc2d8b)

[Why] It looks like `erlang_vm_dist_node_queue_size_bytes` is not always present, even though other Erlang-specific metrics are present. [How] The goal is to ensure Erlang metrics are present in the output, so just use another one that is likely to be there. (cherry picked from commit c672935)

[Why] ehie flaked today since the restart took 309ms, thus above the allowed 100ms (outside of CI, it takes single-digit ms) [How] Increase the allowed time but also significantly increase next_seq_id. This test exists because in the past we had an O(n) algorithm in CQ recovery, leading to a slow recovery of even empty queues, if they had a very large next_seq_id. Now that this operation is O(1), a much larger next_seq_id shouldn't affect the time it takes to run this test, while accidentally re-introducing an O(n) algorithm should fail this test consistently. (cherry picked from commit 37b7a2a)

[Why] In the `node_channel_limit` testcase, we open several channels and verify the count of opened channels in all places but one: after the first connection failure, when we try to open 3 channels. Opening 3 channels in a row might not be tracked in time to reject the third channel because the counter is updated asynchronously. [How] We simply wait for the counter to reach 5 before opening the third channel. We change all checks to use `?awaitMatch/3` in the process to be more robust with timing issues. (cherry picked from commit 6111c27)

... when testing vhost limits [Why] The tracking is aynchronous, thus the third MQTT connection might be opened before the tracking is up-to-date, which the testcase doesn't expect. (cherry picked from commit 5aab965)

Prior to this commit, the following test case flaked: ``` make -C deps/rabbitmq_mqtt ct-v5 t=cluster_size_1:session_upgrade_v3_v5_qos1 ``` The test case failed with: ``` {v5_SUITE,session_upgrade_v3_v5_qos,1112} {test_case_failed,Received unexpected PUBLISH payload. Expected: <<"2">> Got: <<"1">>} ``` The broker logs showed: ``` 2025-07-15 15:50:23.914152+00:00 [debug] <0.758.0> MQTT accepting TCP connection <0.758.0> (127.0.0.1:38594 -> 127.0.0.1:27005) 2025-07-15 15:50:23.914289+00:00 [debug] <0.758.0> Received a CONNECT, client ID: session_upgrade_v3_v5_qos, username: undefined, clean start: false, protocol version: 3, keepalive: 60, property names: [] 2025-07-15 15:50:23.914403+00:00 [debug] <0.758.0> MQTT connection 127.0.0.1:38594 -> 127.0.0.1:27005 picked vhost using plugin_configuration_or_default_vhost 2025-07-15 15:50:23.914480+00:00 [debug] <0.758.0> User 'guest' authenticated successfully by backend rabbit_auth_backend_internal 2025-07-15 15:50:23.914641+00:00 [info] <0.758.0> Accepted MQTT connection 127.0.0.1:38594 -> 127.0.0.1:27005 for client ID session_upgrade_v3_v5_qos 2025-07-15 15:50:23.914977+00:00 [debug] <0.758.0> Received a SUBSCRIBE with subscription(s) [{mqtt_subscription, 2025-07-15 15:50:23.914977+00:00 [debug] <0.758.0> <<"session_upgrade_v3_v5_qos">>, 2025-07-15 15:50:23.914977+00:00 [debug] <0.758.0> {mqtt_subscription_opts,1,false, 2025-07-15 15:50:23.914977+00:00 [debug] <0.758.0> false,0,undefined}}] 2025-07-15 15:50:23.924503+00:00 [debug] <0.764.0> MQTT accepting TCP connection <0.764.0> (127.0.0.1:38608 -> 127.0.0.1:27005) 2025-07-15 15:50:23.924922+00:00 [debug] <0.764.0> Received a CONNECT, client ID: session_upgrade_v3_v5_qos, username: undefined, clean start: false, protocol version: 5, keepalive: 60, property names: [] 2025-07-15 15:50:23.925589+00:00 [error] <0.758.0> writing to MQTT socket #Port<0.63> failed: closed 2025-07-15 15:50:23.925635+00:00 [debug] <0.764.0> MQTT connection 127.0.0.1:38608 -> 127.0.0.1:27005 picked vhost using plugin_configuration_or_default_vhost 2025-07-15 15:50:23.925670+00:00 [info] <0.758.0> MQTT connection <<"127.0.0.1:38594 -> 127.0.0.1:27005">> will terminate because peer closed TCP connection 2025-07-15 15:50:23.925727+00:00 [debug] <0.764.0> User 'guest' authenticated successfully by backend rabbit_auth_backend_internal 2025-07-15 15:50:24.000790+00:00 [info] <0.764.0> Accepted MQTT connection 127.0.0.1:38608 -> 127.0.0.1:27005 for client ID session_upgrade_v3_v5_qos 2025-07-15 15:50:24.016553+00:00 [warning] <0.764.0> MQTT disconnecting client <<"127.0.0.1:38608 -> 127.0.0.1:27005">> with client ID 'session_upgrade_v3_v5_qos', reason: normal ``` This shows evidence that the MQTT server connection did not process the DISCONNECT packet. The hypothesis is that the server connection did not even process the PUBACK packet from the client. Hence, the first message got requeued and re-delivered to the new v5 client. This commit fixes this flake by not acking the first message. Hence, we always expect that the first message will be redelivered to the new v5 client. (cherry picked from commit 8307aa6)

[Why] Sometimes, at least in CI, it looks like the output of the CLI is prepended with a newline, sometimes not. This breaks the check of that output. [How] We just trim the output before parsing it. The parsing already takes care of trimming internal whitespaces. (cherry picked from commit ffaf919)

…er/1` [Why] Sometimes it returns `false` in CI. `meck:validate/1` can return false in the module throws an exception. So perhaps a timing issue in CI where the runner is usually slower than our working computers? (cherry picked from commit 83b8a6b)

... in several test cases. [Why] In CI or any slow and/or busy environment, it may take time for the ETS tables to ge updated. (cherry picked from commit a44d541)

…luster/1` [Why] In CI, we sometimes get a failure when we try to forget node 3. The CLI doesn't report the nature of the error unfortunately. I suppose it's related to the fact that node 3 is stopped and forgotten before all three replicas were ready when the stream queue was declared. This is just a guess though and have no proof that it is the actual error. [How] We wait for the replicas after declaring the stream queue. (cherry picked from commit 0fb74ba)

(cherry picked from commit fd1d037)

[Why] If we use the list of reachable nodes, it includes nodes which are currently booting. Trying to start vhost during their start can disturb their initialization and has a great chance to fail anyway. (cherry picked from commit d154e3d)

[Why] In CI, it fails to fetch dependencies quite frequently, probably due to some proxy in GitHub Actions. (cherry picked from commit 4c8835f)

[Why] This doesn't replicate the common_test logs layout, but it will be good enough to let our GitHub Actions workflow to upload the logs without specific instructions in the workflow. (cherry picked from commit 8d0f100)

…mit/1` [Why] It looks to be too short in CI, causing failures from time to time. (cherry picked from commit fd4c365)

…t of connections [Why] In CI, we sometimes observe two tracked connections in the return value. I don't know yet what they are. Could it be a client that reopened its crashed connection and because stats are updated asynchronously, we get two tracked connections for a short period of time? (cherry picked from commit 53d0b14)

…nections [Why] In CI, we sometimes observe two tracked connections in the return value. I don't know yet what they are. Could it be a client that reopened its crashed connection and because stats are updated asynchronously, we get two tracked connections for a short period of time? (cherry picked from commit ed1cdb5)

…n CI [Why] The `stream_pub_sub_metrics` test failed at least once in CI because the `rabbitmq_stream_consumer_max_offset_lag` was 4 instead of the expected 3 on line 815. I couldn't reproduce the problem so far. [How] The test case now logs the initial value of that metric at the beginning of the test function. Hopefully this will give us some clue for the day it fails again. (cherry picked from commit 2bc8d11)

[Why] I wonder if a previous test interferes with the metrics verified by this test case. To be safer, execute it first and let's see what happens. (cherry picked from commit 2674456)

…licitly [Why] In CI, we observe that the channel hangs sometimes. rabbitmq_ct_client_helpers implicit connection is quite fragile in the sense that a test case can disturb the next one in some cases. [How] Let's use a dedicated connection and see if it fixes the problem. (cherry picked from commit 17feaa1)

(cherry picked from commit 832d701)

…ag that never existed [Why] The `rabbit_consistent_hash_exchange_raft_based_metadata_store` does not seem to be a feature flag that ever existed according to the git history. This causes the test case to always be skipped. [How] Simply remove the statement that enables this ghost feature flag. (cherry picked from commit ea2689f)

(cherry picked from commit 73c663e)

[Why] Maven took ages to fetch dependencies at least once in CI. The testsuite failed because it reached the time trap limit. [How] Increase it from 2 to 5 minutes. (cherry picked from commit 19ed249)

[Why] The reason is the same as for commit ffaf919. It should have been part of it in fact, so an oversight from my end. (cherry picked from commit 56b59c3)

… anonymous functions [Why] Before this change, when the `idle_time_out_on_server/1` test case was runned first in the shuffled test group, the test module was not loaded on the remote broker. When the anonymous function was passed to meck and was executed, we got the following crash on the broker: crasher: initial call: rabbit_heartbeat:'-heartbeater/2-fun-0-'/0 pid: <0.704.0> registered_name: [] exception error: {undef, [{#Fun<amqp_client_SUITE.14.116163631>, [#Port<0.45>,[recv_oct]], []}, {rabbit_heartbeat,get_sock_stats,3, [{file,"rabbit_heartbeat.erl"},{line,175}]}, {rabbit_heartbeat,heartbeater,3, [{file,"rabbit_heartbeat.erl"},{line,155}]}, {proc_lib,init_p,3, [{file,"proc_lib.erl"},{line,317}]}, {rabbit_net,getstat,[#Port<0.45>,[recv_oct]],[]}]} This led to a failure of the test case later, when it waited for a message from the connecrtion. We do the same in two other test cases where this is likely to happen too. [How] Loading the module first fixes the problem. (cherry picked from commit bd1978c)

[Why] Relying on the return value of the queue deletion is fragile because the policy is cleared asynchronously. [How] We now wait for the queues to reach the expected queue length, then we delete them and ensure the length didn't change. (cherry picked from commit efdec84)

[Why] There is a frequent failure in CI and the fact that all test cases use the same resource names does not help with debugging. (cherry picked from commit 5936b3b)

[Why] This should also help debug the failures we get in CI. (cherry picked from commit fda663d)

[Why] It failed at least once in CI. It should help us understand what went on. (cherry picked from commit 0a643ef)

... when testing user limits [How] This is the same fix as the one for the vhost limits test case made in commit 5aab965. While here, fix a compiler warning about an unused variable. (cherry picked from commit 02b1561)

…se2` [Why] The connection is about to be killed at the end of the test case. It's not necessary to close it explicitly. Moreover, on a slow environment like CI, the connection process might have already exited when the test case tries to close it. In this case, it fails with a `noproc` exception. (cherry picked from commit 0601ef4)

[Why] `gen_tcp:close/1` simply closes the connection and doesn't wait for the broker to handle it. This sometimes causes the next test to fail because, in addition to that test's new connection, there is still the previous one's process still around waiting for the broker to notice the close. [How] We now wait for the connection to be closed at the end of a test case, and wait for the connection list to have a single element when we want to query the connnection name. (cherry picked from commit eb8f631)

…pic_dest` [Why] The `test_topic_dest` test case fails from time to time in CI. I don't know why as there are no errors logged anywhere. Let's assume it's a timeout a bit too short. While here, apply the same change to `test_exchange_dest`. (cherry picked from commit 5f520b8)

(cherry picked from commit 350bda1)

[Why] I still don't know what causes the transient failures in this testsuite. The AMQP connection is closed asynchronously, therefore the next test case is running when it finishes to close. I have no idea if it causes troubles, but it makes the broker logs more difficult to read. (cherry picked from commit 766ca19)

[Why] I noticed the following error in a test case: error sending frame Traceback (most recent call last): File "/home/runner/work/rabbitmq-server/rabbitmq-server/deps/rabbitmq_stomp/test/python_SUITE_data/src/deps/stomp/transport.py", line 623, in send self.socket.sendall(encoded_frame) OSError: [Errno 9] Bad file descriptor When the test suite succeeds, this error is not present. When it failed, it was present. But I checked only one instance of each, it's not enough to draw any conclusion about the relationship between this error and the failing test case later. I have no idea which test case hits this error, so increase the verbosity, in the hope we see the name of the test case running at the time of this error. (cherry picked from commit 5bfb7bc)

(cherry picked from commit 0a5024b)

mergify bot added the conflicts label Aug 12, 2025

mergify bot mentioned this pull request Aug 12, 2025

Collection of test fixes (2025Q2, batch 1) #14206

Merged

dumbbell added 3 commits August 12, 2025 17:02

Improve many testsuites to make them work with mixed versions of Khepri

09d35b8

(cherry picked from commit b4cda4a)

cluster_minority_SUITE: Use Config1, not Config

badce2f

... which is a previous copy of the variable. Also, while here, fix style: the line was breaking the 80-col limit. (cherry picked from commit 51aba85)

dumbbell force-pushed the mergify/bp/v4.1.x/pr-14206 branch from 3110116 to f14332f Compare August 12, 2025 15:22

dumbbell removed the conflicts label Aug 12, 2025

dumbbell force-pushed the mergify/bp/v4.1.x/pr-14206 branch from f14332f to 06b70e0 Compare August 12, 2025 16:33

mergify bot added the make label Aug 12, 2025

dumbbell and others added 20 commits August 12, 2025 19:08

auth_SUITE: Wait for connection tracking to be up-to-date

ddd1ce2

... when testing vhost limits [Why] The tracking is aynchronous, thus the third MQTT connection might be opened before the tracking is up-to-date, which the testcase doesn't expect. (cherry picked from commit 5aab965)

metrics_SUITE: Wait for ETS table to be up-to-date

d70ee13

... in several test cases. [Why] In CI or any slow and/or busy environment, it may take time for the ETS tables to ge updated. (cherry picked from commit a44d541)

Deflake per_user_connection_tracking_SUITE

424c030

(cherry picked from commit fd1d037)

rabbitmq_stream_management: Tell Maven to retry when fetching deps

081b5bd

[Why] In CI, it fails to fetch dependencies quite frequently, probably due to some proxy in GitHub Actions. (cherry picked from commit 4c8835f)

rabbitmq_cli: Create a symlink to test node's logs

cb83ba9

[Why] This doesn't replicate the common_test logs layout, but it will be good enough to let our GitHub Actions workflow to upload the logs without specific instructions in the workflow. (cherry picked from commit 8d0f100)

rabbit_fifo_dlx_integration_SUITE: Increase a timeout in `delivery_li…

def8b67

…mit/1` [Why] It looks to be too short in CI, causing failures from time to time. (cherry picked from commit fd4c365)

dumbbell added 22 commits August 12, 2025 19:08

rabbit_prometheus_http_SUITE: Run stream_pub_sub_metrics first

076d878

[Why] I wonder if a previous test interferes with the metrics verified by this test case. To be safer, execute it first and let's see what happens. (cherry picked from commit 2674456)

rabbit_exchange_type_consistent_hash_SUITE: Set timetrap to 5 minutes

18af7da

(cherry picked from commit 832d701)

rabbit_stream_partitions_SUITE: Fix incorrect log message

1707c28

(cherry picked from commit 73c663e)

amqp_jms_SUITE: Increase time trap

3516cb7

[Why] Maven took ages to fetch dependencies at least once in CI. The testsuite failed because it reached the time trap limit. [How] Increase it from 2 to 5 minutes. (cherry picked from commit 19ed249)

amqp_client_SUITE: Trim "list_connections" output in one more place

135fff9

[Why] The reason is the same as for commit ffaf919. It should have been part of it in fact, so an oversight from my end. (cherry picked from commit 56b59c3)

amqp10_inter_cluster_SUITE: Use per-test shovel names

2b5148f

[Why] There is a frequent failure in CI and the fact that all test cases use the same resource names does not help with debugging. (cherry picked from commit 5936b3b)

amqp10_inter_cluster_SUITE: Log messages and queues length

0a24602

[Why] This should also help debug the failures we get in CI. (cherry picked from commit fda663d)

feature_flags_v2_SUITE: Catch and log return value of peer:stop/1

141f183

[Why] It failed at least once in CI. It should help us understand what went on. (cherry picked from commit 0a643ef)

auth_SUITE: Wait for connection tracking to be up-to-date

d036cf1

... when testing user limits [How] This is the same fix as the one for the vhost limits test case made in commit 5aab965. While here, fix a compiler warning about an unused variable. (cherry picked from commit 02b1561)

python_SUITE: Bump Python dependencies to their latest versions

1c745d8

(cherry picked from commit 350bda1)

python_SUITE: Add more debug messages

34dac81

(cherry picked from commit 0a5024b)

dumbbell force-pushed the mergify/bp/v4.1.x/pr-14206 branch from c028f44 to 34dac81 Compare August 12, 2025 17:08

michaelklishin added this to the 4.1.4 milestone Aug 12, 2025

michaelklishin merged commit c82efd8 into v4.1.x Aug 12, 2025
546 of 547 checks passed

michaelklishin deleted the mergify/bp/v4.1.x/pr-14206 branch August 12, 2025 19:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Collection of test fixes (2025Q2, batch 1) (backport #14206) #14372

Collection of test fixes (2025Q2, batch 1) (backport #14206) #14372

mergify bot commented Aug 12, 2025

Uh oh!

mergify bot commented Aug 12, 2025

Uh oh!

Uh oh!

Uh oh!

Collection of test fixes (2025Q2, batch 1) (backport #14206) #14372

Collection of test fixes (2025Q2, batch 1) (backport #14206) #14372

Conversation

mergify bot commented Aug 12, 2025

Uh oh!

mergify bot commented Aug 12, 2025

Uh oh!

Uh oh!

Uh oh!