Collection of test fixes (2025Q2, batch 2) #14310

dumbbell · 2025-07-30T14:43:24Z

This pull request addresses the test flakes that appeared in the past couple months. This is a follow-up to #14206 for failures that were not detected as part of the first pull request.

[Why] If we use the list of reachable nodes, it includes nodes which are currently booting. Trying to start vhost during their start can disturb their initialization and has a great chance to fail anyway.

[Why] In CI, it fails to fetch dependencies quite frequently, probably due to some proxy in GitHub Actions.

[Why] This doesn't replicate the common_test logs layout, but it will be good enough to let our GitHub Actions workflow to upload the logs without specific instructions in the workflow.

…mit/1` [Why] It looks to be too short in CI, causing failures from time to time.

…t of connections [Why] In CI, we sometimes observe two tracked connections in the return value. I don't know yet what they are. Could it be a client that reopened its crashed connection and because stats are updated asynchronously, we get two tracked connections for a short period of time?

…nections [Why] In CI, we sometimes observe two tracked connections in the return value. I don't know yet what they are. Could it be a client that reopened its crashed connection and because stats are updated asynchronously, we get two tracked connections for a short period of time?

[Why] In CI, we observed failures where the sender runs out of credits and don't expect that. [How] The `amqp_utils:send_messages/3` function already takes care of that. Move this logic to a `send_message/2` function and use it in `send_messages/3` and prevriously direct uses of `amqp10_client:send_msg/2`.

…n CI [Why] The `stream_pub_sub_metrics` test failed at least once in CI because the `rabbitmq_stream_consumer_max_offset_lag` was 4 instead of the expected 3 on line 815. I couldn't reproduce the problem so far. [How] The test case now logs the initial value of that metric at the beginning of the test function. Hopefully this will give us some clue for the day it fails again.

[Why] I wonder if a previous test interferes with the metrics verified by this test case. To be safer, execute it first and let's see what happens.

…licitly [Why] In CI, we observe that the channel hangs sometimes. rabbitmq_ct_client_helpers implicit connection is quite fragile in the sense that a test case can disturb the next one in some cases. [How] Let's use a dedicated connection and see if it fixes the problem.

…ag that never existed [Why] The `rabbit_consistent_hash_exchange_raft_based_metadata_store` does not seem to be a feature flag that ever existed according to the git history. This causes the test case to always be skipped. [How] Simply remove the statement that enables this ghost feature flag.

[Why] Maven took ages to fetch dependencies at least once in CI. The testsuite failed because it reached the time trap limit. [How] Increase it from 2 to 5 minutes.

[Why] It didn't handle them before and crashed later when it assumed the return value was a list.

[Why] The reason is the same as for commit ffaf919. It should have been part of it in fact, so an oversight from my end.

… anonymous functions [Why] Before this change, when the `idle_time_out_on_server/1` test case was runned first in the shuffled test group, the test module was not loaded on the remote broker. When the anonymous function was passed to meck and was executed, we got the following crash on the broker: crasher: initial call: rabbit_heartbeat:'-heartbeater/2-fun-0-'/0 pid: <0.704.0> registered_name: [] exception error: {undef, [{#Fun<amqp_client_SUITE.14.116163631>, [#Port<0.45>,[recv_oct]], []}, {rabbit_heartbeat,get_sock_stats,3, [{file,"rabbit_heartbeat.erl"},{line,175}]}, {rabbit_heartbeat,heartbeater,3, [{file,"rabbit_heartbeat.erl"},{line,155}]}, {proc_lib,init_p,3, [{file,"proc_lib.erl"},{line,317}]}, {rabbit_net,getstat,[#Port<0.45>,[recv_oct]],[]}]} This led to a failure of the test case later, when it waited for a message from the connecrtion. We do the same in two other test cases where this is likely to happen too. [How] Loading the module first fixes the problem.

[Why] Relying on the return value of the queue deletion is fragile because the policy is cleared asynchronously. [How] We now wait for the queues to reach the expected queue length, then we delete them and ensure the length didn't change.

[Why] There is a frequent failure in CI and the fact that all test cases use the same resource names does not help with debugging.

[Why] This should also help debug the failures we get in CI.

[Why] It failed at least once in CI. It should help us understand what went on.

[Why] It didn't handle them before and crashed later when it assumed the return value was a list.

... when testing user limits [How] This is the same fix as the one for the vhost limits test case made in commit 5aab965. While here, fix a compiler warning about an unused variable.

…se2` [Why] The connection is about to be killed at the end of the test case. It's not necessary to close it explicitly. Moreover, on a slow environment like CI, the connection process might have already exited when the test case tries to close it. In this case, it fails with a `noproc` exception.

[Why] `gen_tcp:close/1` simply closes the connection and doesn't wait for the broker to handle it. This sometimes causes the next test to fail because, in addition to that test's new connection, there is still the previous one's process still around waiting for the broker to notice the close. [How] We now wait for the connection to be closed at the end of a test case, and wait for the connection list to have a single element when we want to query the connnection name.

[Why] It didn't handle them before and crashed later when it assumed the return value was a list.

…pic_dest` [Why] The `test_topic_dest` test case fails from time to time in CI. I don't know why as there are no errors logged anywhere. Let's assume it's a timeout a bit too short. While here, apply the same change to `test_exchange_dest`.

[Why] I still don't know what causes the transient failures in this testsuite. The AMQP connection is closed asynchronously, therefore the next test case is running when it finishes to close. I have no idea if it causes troubles, but it makes the broker logs more difficult to read.

[Why] I noticed the following error in a test case: error sending frame Traceback (most recent call last): File "/home/runner/work/rabbitmq-server/rabbitmq-server/deps/rabbitmq_stomp/test/python_SUITE_data/src/deps/stomp/transport.py", line 623, in send self.socket.sendall(encoded_frame) OSError: [Errno 9] Bad file descriptor When the test suite succeeds, this error is not present. When it failed, it was present. But I checked only one instance of each, it's not enough to draw any conclusion about the relationship between this error and the failing test case later. I have no idea which test case hits this error, so increase the verbosity, in the hope we see the name of the test case running at the time of this error.

mkuratczyk · 2025-08-12T14:08:17Z

@Mergifyio backport v4.1.x

mergify · 2025-08-12T14:08:28Z

backport v4.1.x

✅ Backports have been created

#14371 Collection of test fixes (2025Q2, batch 2) (backport #14310) has been created for branch v4.1.x

dumbbell self-assigned this Jul 30, 2025

dumbbell force-pushed the fix-test-flakes-2025Q2 branch 6 times, most recently from 17a7f75 to e66cc6d Compare August 4, 2025 17:46

mergify bot added the make label Aug 4, 2025

dumbbell force-pushed the fix-test-flakes-2025Q2 branch 11 times, most recently from af91d61 to d3ad756 Compare August 8, 2025 08:04

dumbbell added 11 commits August 8, 2025 10:12

rabbit_vhosts: Only reconcile vhost procs on nodes running RabbitMQ

d154e3d

[Why] If we use the list of reachable nodes, it includes nodes which are currently booting. Trying to start vhost during their start can disturb their initialization and has a great chance to fail anyway.

rabbitmq_stream_management: Tell Maven to retry when fetching deps

4c8835f

[Why] In CI, it fails to fetch dependencies quite frequently, probably due to some proxy in GitHub Actions.

rabbitmq_cli: Create a symlink to test node's logs

8d0f100

[Why] This doesn't replicate the common_test logs layout, but it will be good enough to let our GitHub Actions workflow to upload the logs without specific instructions in the workflow.

rabbit_fifo_dlx_integration_SUITE: Increase a timeout in `delivery_li…

fd4c365

…mit/1` [Why] It looks to be too short in CI, causing failures from time to time.

rabbit_prometheus_http_SUITE: Run stream_pub_sub_metrics first

2674456

[Why] I wonder if a previous test interferes with the metrics verified by this test case. To be safer, execute it first and let's see what happens.

rabbit_exchange_type_consistent_hash_SUITE: Set timetrap to 5 minutes

832d701

dumbbell added 20 commits August 8, 2025 10:12

rabbit_stream_partitions_SUITE: Fix incorrect log message

73c663e

amqp_jms_SUITE: Increase time trap

19ed249

[Why] Maven took ages to fetch dependencies at least once in CI. The testsuite failed because it reached the time trap limit. [How] Increase it from 2 to 5 minutes.

mqtt_shared_SUITE: Handle error returned by rabbit_ct_broker_helpers

8bdbb0f

[Why] It didn't handle them before and crashed later when it assumed the return value was a list.

amqp_client_SUITE: Trim "list_connections" output in one more place

56b59c3

[Why] The reason is the same as for commit ffaf919. It should have been part of it in fact, so an oversight from my end.

amqp10_inter_cluster_SUITE: Use per-test shovel names

5936b3b

[Why] There is a frequent failure in CI and the fact that all test cases use the same resource names does not help with debugging.

amqp10_inter_cluster_SUITE: Log messages and queues length

fda663d

[Why] This should also help debug the failures we get in CI.

feature_flags_v2_SUITE: Catch and log return value of peer:stop/1

0a643ef

[Why] It failed at least once in CI. It should help us understand what went on.

auth_SUITE: Handle error returned by rabbit_ct_broker_helpers

5c1456b

[Why] It didn't handle them before and crashed later when it assumed the return value was a list.

auth_SUITE: Wait for connection tracking to be up-to-date

02b1561

... when testing user limits [How] This is the same fix as the one for the vhost limits test case made in commit 5aab965. While here, fix a compiler warning about an unused variable.

cluster_SUITE: Handle error returned by rabbit_ct_broker_helpers

0e36184

[Why] It didn't handle them before and crashed later when it assumed the return value was a list.

python_SUITE: Bump Python dependencies to their latest versions

350bda1

python_SUITE: Add more debug messages

0a5024b

dumbbell force-pushed the fix-test-flakes-2025Q2 branch from d3ad756 to 0a5024b Compare August 8, 2025 08:33

dumbbell marked this pull request as ready for review August 8, 2025 08:56

dumbbell merged commit 2883962 into main Aug 8, 2025
283 checks passed

dumbbell deleted the fix-test-flakes-2025Q2 branch August 8, 2025 08:56

mkuratczyk added the backport-v4.1.x label Aug 12, 2025

mergify bot mentioned this pull request Aug 12, 2025

Collection of test fixes (2025Q2, batch 2) (backport #14310) #14371

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Collection of test fixes (2025Q2, batch 2) #14310

Collection of test fixes (2025Q2, batch 2) #14310

Uh oh!

dumbbell commented Jul 30, 2025

Uh oh!

Uh oh!

mkuratczyk commented Aug 12, 2025

Uh oh!

mergify bot commented Aug 12, 2025 •

edited

Loading

Uh oh!

Uh oh!

Collection of test fixes (2025Q2, batch 2) #14310

Collection of test fixes (2025Q2, batch 2) #14310

Uh oh!

Conversation

dumbbell commented Jul 30, 2025

Uh oh!

Uh oh!

mkuratczyk commented Aug 12, 2025

Uh oh!

mergify bot commented Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Backports have been created

Uh oh!

Uh oh!

mergify bot commented Aug 12, 2025 •

edited

Loading