Skip to content

test_perf_profiler fails on aarch64 Fedora Stable Refleaks buildbot #131038

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
encukou opened this issue Mar 10, 2025 · 15 comments
Open

test_perf_profiler fails on aarch64 Fedora Stable Refleaks buildbot #131038

encukou opened this issue Mar 10, 2025 · 15 comments
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage type-bug An unexpected behavior, bug, or error

Comments

@encukou
Copy link
Member

encukou commented Mar 10, 2025

Bug report

Since about 4 days ago, test_perf_profiler.test_python_calls_appear_in_the_stack_if_perf_activated usually fails on refleaks buildbots with one of these (see build 416 for both):

  • AssertionError: 'py::foo:/tmp/test_python_2wtkitdm/tmpv30razor/perftest.py' not found in <very long string containing frames with a different temp dir instead: /tmp/test_python_ecqze28x/tmpz61vpvo4/perftest.py+0xb here>
  • unexpected stderr:
    Warning:
    Processed 688 events and lost 3 chunks!
    
    Check IO/CPU overload!
    
@encukou encukou added 3.13 bugs and security fixes performance Performance or resource usage type-bug An unexpected behavior, bug, or error labels Mar 10, 2025
@vstinner
Copy link
Member

cc @pablogsal @stratakis

@pablogsal
Copy link
Member

Unfortunately I cannot reproduce this locally and I don't have any idea of what may be happening in the buildbot. Was this updated recently?

@pablogsal
Copy link
Member

I cannot reproduce in the buildbot either:

pablo@localhost:~/cpython$ ./python -m test test_perf_profiler -v
== CPython 3.14.0a5+ (heads/main:8431b1fee88, Mar 10 2025, 16:20:11) [GCC 14.2.1 20250110 (Red Hat 14.2.1-7)]
== Linux-6.13.5-200.fc41.aarch64-aarch64-with-glibc2.40 little-endian
== Python build: debug
== cwd: /home/pablo/cpython/build/test_python_worker_160904æ
== CPU count: 40
== encodings: locale=UTF-8 FS=utf-8
== resources: all test resources are disabled, use -u option to unskip tests

Using random seed: 2682059707
0:00:00 load avg: 16.79 Run 1 test sequentially in a single process
0:00:00 load avg: 16.79 [1/1] test_perf_profiler
test_pre_fork_compile (test.test_perf_profiler.TestPerfProfiler.test_pre_fork_compile) ... ok
test_python_calls_appear_in_the_stack_if_perf_activated (test.test_perf_profiler.TestPerfProfiler.test_python_calls_appear_in_the_stack_if_perf_activated) ... ok
test_python_calls_do_not_appear_in_the_stack_if_perf_deactivated (test.test_perf_profiler.TestPerfProfiler.test_python_calls_do_not_appear_in_the_stack_if_perf_deactivated) ... ok
test_python_calls_appear_in_the_stack_if_perf_activated (test.test_perf_profiler.TestPerfProfilerWithDwarf.test_python_calls_appear_in_the_stack_if_perf_activated) ... ok
test_python_calls_do_not_appear_in_the_stack_if_perf_deactivated (test.test_perf_profiler.TestPerfProfilerWithDwarf.test_python_calls_do_not_appear_in_the_stack_if_perf_deactivated) ... ok
test_sys_api (test.test_perf_profiler.TestPerfTrampoline.test_sys_api) ... ok
test_sys_api_get_status (test.test_perf_profiler.TestPerfTrampoline.test_sys_api_get_status) ... ok
test_sys_api_with_existing_trampoline (test.test_perf_profiler.TestPerfTrampoline.test_sys_api_with_existing_trampoline) ... ok
test_sys_api_with_invalid_trampoline (test.test_perf_profiler.TestPerfTrampoline.test_sys_api_with_invalid_trampoline) ... ok
test_trampoline_works (test.test_perf_profiler.TestPerfTrampoline.test_trampoline_works) ... ok
test_trampoline_works_with_forks (test.test_perf_profiler.TestPerfTrampoline.test_trampoline_works_with_forks) ... ok

----------------------------------------------------------------------
Ran 11 tests in 19.774s

OK
0:00:20 load avg: 24.35 [1/1] test_perf_profiler passed

== Tests result: SUCCESS ==

1 test OK.

Total duration: 20.9 sec
Total tests: run=11
Total test files: run=1/1
Result: SUCCESS
pablo@localhost:~/cpython$ uname -a
Linux localhost.localdomain 6.13.5-200.fc41.aarch64 #1 SMP PREEMPT_DYNAMIC Thu Feb 27 15:30:52 UTC 2025 aarch64 GNU/Linux

It seems to be related to perf dropping events under high CPU usage. I honestly don't want to skip the test in this case because that is going to be a pain if we have real failures

@pablogsal
Copy link
Member

@stratakis Can you investigate if something has changed recently in the buildbot?

@pablogsal
Copy link
Member

pablogsal commented Mar 10, 2025

I have been testing under high load using this script:

import multiprocessing
import time
import argparse

def cpu_load(duration=None, intensity=1.0):
    """
    Generate CPU load on a single core.
    
    Args:
        duration (float, optional): Duration in seconds to run the load.
                                   If None, runs until interrupted.
        intensity (float): Value between 0.0 and 1.0 to control load intensity.
                          1.0 means 100% load, 0.5 means approximately 50% load.
    """
    start_time = time.time()
    
    while True:
        # Busy work
        x = 0
        # Adjust end range to control intensity
        end = int(10000000 * intensity)
        for i in range(end):
            x += i
            x *= 1.0000001
        
        # Sleep to reduce CPU usage if intensity < 1.0
        if intensity < 1.0:
            time.sleep((1.0 - intensity) * 0.1)
            
        # Check if duration has been reached
        if duration is not None and time.time() - start_time >= duration:
            break

def run_load_test(cores=None, duration=None, intensity=1.0):
    """
    Run CPU load test on multiple cores.
    
    Args:
        cores (int, optional): Number of cores to utilize. 
                              If None, uses all available cores.
        duration (float, optional): Duration in seconds to run the test.
                                   If None, runs until interrupted.
        intensity (float): Value between 0.0 and 1.0 to control load intensity.
    """
    if cores is None:
        cores = multiprocessing.cpu_count()
    
    print(f"Starting CPU load test on {cores} cores with {intensity*100}% intensity")
    if duration:
        print(f"Test will run for {duration} seconds")
    else:
        print("Test will run until interrupted with Ctrl+C")
    
    # Create and start processes
    processes = []
    try:
        for _ in range(cores):
            p = multiprocessing.Process(target=cpu_load, args=(duration, intensity))
            processes.append(p)
            p.start()
        
        # Wait for processes to complete if duration is specified
        if duration:
            for p in processes:
                p.join()
            print("CPU load test completed")
        else:
            # Keep the main process running until interrupted
            while True:
                time.sleep(1)
                
    except KeyboardInterrupt:
        print("Stopping CPU load test...")
        for p in processes:
            if p.is_alive():
                p.terminate()
        print("CPU load test stopped")

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Generate CPU load for testing")
    parser.add_argument("-c", "--cores", type=int, default=None, 
                        help="Number of cores to utilize (default: all available)")
    parser.add_argument("-d", "--duration", type=float, default=None,
                        help="Duration in seconds (default: run until interrupted)")
    parser.add_argument("-i", "--intensity", type=float, default=1.0,
                        help="Load intensity from 0.0 to 1.0 (default: 1.0)")
    
    args = parser.parse_args()
    
    run_load_test(args.cores, args.duration, args.intensity)

and still can't reproduce even on the buildbot

@pablogsal
Copy link
Member

I'm testing with the buildbot at 147.75.54.63

@stratakis
Copy link
Contributor

The failures coincide with the dnf update I've done a couple of day ago.

More specifically the kernel (hence Perf as well) got updated from 6.12.11-200 to 6.13.5-200.

update.txt

I'm attaching the full dnf transaction.

@stratakis
Copy link
Contributor

However I was running last week some Python compilations of my own to test some thing in the buildbot and if that happened at the same time a ref leak build was running that could be the problem. I'd let this open until a couple more refleak builds occur and if nothing shows up, we can close the issue.

@encukou
Copy link
Member Author

encukou commented Mar 11, 2025

That was probably it. The refleak buildbots were green for a while, before starting to fail with an unrelated change.
I'll confirm tomorrow.

@encukou
Copy link
Member Author

encukou commented Mar 13, 2025

Yup.
Sorry for the noise!

@encukou encukou closed this as completed Mar 13, 2025
@encukou encukou closed this as not planned Won't fix, can't repro, duplicate, stale Mar 13, 2025
@picnixz
Copy link
Member

picnixz commented Mar 23, 2025

The build bots are still failing but they aren't really "green". Most of the time they're yellow but the test is indeed failing:

Image

The branch is 3.14 and the rightmost red is 10days ago. We're having more flaky failures than non-failures so I'm not really sure if there isn't something else happening.

@picnixz
Copy link
Member

picnixz commented Mar 29, 2025

I'm re-opening it because we're still seeing failures (and this is flagged by https://buildbot.python.org/#/release_status). And the failures become more and more frequent at first glance.

@picnixz picnixz reopened this Mar 29, 2025
@picnixz picnixz changed the title 3.13: test_perf_profiler fails on aarch64 Fedora Stable Refleaks buildbot test_perf_profiler fails on aarch64 Fedora Stable Refleaks buildbot Mar 29, 2025
@picnixz picnixz added interpreter-core (Objects, Python, Grammar, and Parser dirs) and removed 3.13 bugs and security fixes labels Mar 29, 2025
@pablogsal
Copy link
Member

I have tried another reproducer round and still could not reproduce

@picnixz
Copy link
Member

picnixz commented Mar 29, 2025

This is really strange. I don't know what happens but I don't know if we should worry about this. @hugovk is this considered a deal breaker for the next release since it's a tier 2 bot that is failing?

As Petr observed, the strings to find vs those that are actually here mismatch a bit. For instance, in the latest failure, the string to find is

py::foo:/tmp/test_python_gjd_5u20/tmpf3ej7o5w/perftest.py

OTOH, we have the following string (quite a lot of time):

py::foo:/tmp/test_python_g13mifgx/tmp4sgs6ndf/perftest.py

So, it seems that the temporary folder is not the correct one. I don't know what happens nor what's being tested, but the fact that we have almost the correct string could also be a parallelism issue? Maybe another test is polluting the /tmp dir for some reason?

@hugovk
Copy link
Member

hugovk commented Mar 29, 2025

I don't know what happens but I don't know if we should worry about this. @hugovk is this considered a deal breaker for the next release since it's a tier 2 bot that is failing?

Hmm, a flaky test won't necessarily block (the last) alpha release. (I'd prefer it fixed or skipped or otherwise green :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

6 participants