Skip to content

os.cpu_count is problematic on sparc/solaris #73451

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
phantal mannequin opened this issue Jan 13, 2017 · 13 comments
Closed

os.cpu_count is problematic on sparc/solaris #73451

phantal mannequin opened this issue Jan 13, 2017 · 13 comments
Labels
OS-unsupported stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@phantal
Copy link
Mannequin

phantal mannequin commented Jan 13, 2017

BPO 29265
Nosy @vstinner, @bitdancer

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2017-01-13.17:48:34.324>
labels = ['type-bug', 'library']
title = 'os.cpu_count is problematic on sparc/solaris'
updated_at = <Date 2017-01-13.22:28:44.539>
user = 'https://bugs.python.org/phantal'

bugs.python.org fields:

activity = <Date 2017-01-13.22:28:44.539>
actor = 'phantal'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Library (Lib)']
creation = <Date 2017-01-13.17:48:34.324>
creator = 'phantal'
dependencies = []
files = []
hgrepos = []
issue_num = 29265
keywords = []
message_count = 5.0
messages = ['285430', '285431', '285432', '285441', '285446']
nosy_count = 3.0
nosy_names = ['vstinner', 'r.david.murray', 'phantal']
pr_nums = []
priority = 'normal'
resolution = None
stage = None
status = 'open'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue29265'
versions = ['Python 3.6']

@phantal
Copy link
Mannequin Author

phantal mannequin commented Jan 13, 2017

I'm attempting to build python 3.6.0 on sparc/solaris 10. After the initial configure/compile complete I ran "make test" and I see:

$ make test
running build
running build_ext
(...)
running build_scripts
copying and adjusting (...)
changing mode of (...)
renaming (...)
(...)
Run tests in parallel using 258 child processes

I'm fairly sure the issue stems from the fact that each core on the machine has 8 "threads" and there's 32 cores (for a total of 256 virtual cores).

Each core can execute 8 parallel tasks only in very specific circumstances. It's intended for use by things like lapack/atlas where you might be doing many computations on the same set of data.

Outside of these more restricted circumstances each core can only handle 2 parallel tasks (or so I gathered from the documentation), so at best this machine could handle 64 backgrounded jobs though I normally restrict my builds to the actual core count or less.

The most common way to get a "realistic" core count on these machines from shell scripts is:

$ core_count=`kstat -m cpu_info | grep core_id | sort -u | wc -l`

... though I'm not sure how the test suite is determining the core count. I didn't see any mention of "kstat" anywhere.

@phantal phantal mannequin added tests Tests in the Lib/test dir type-bug An unexpected behavior, bug, or error labels Jan 13, 2017
@phantal
Copy link
Mannequin Author

phantal mannequin commented Jan 13, 2017

I forgot to mention, this wasn't an issue in 3.5.1 though I never did check how many jobs it was using.

I ran into other issues building that version and moved to a newer version because at least one of them (logging test race condition) was fixed after 3.5.1.

@phantal
Copy link
Mannequin Author

phantal mannequin commented Jan 13, 2017

This is odd. I just went back and re-ran 3.5.1 to see how many cores and it's having the same problem now. So, scratch that last coment.

@bitdancer
Copy link
Member

You don't know it, but you are actually reporting a possible sparc/Solaris specific (I think) bug against os.cpu_count. There was some discussion about how cpu_count might be problematic in this regard.

It doesn't cause any real problem with the tests, though. I routinely run with -j40 on my 2 cpu test box because the test run completes faster that way due to the way many tests spend time waiting for various things.

@bitdancer bitdancer added stdlib Python modules in the Lib dir and removed tests Tests in the Lib/test dir labels Jan 13, 2017
@bitdancer bitdancer changed the title test suite is attempting to spawn 258 child processes to run tests os.cpu_count is problematic on sparc/solaris Jan 13, 2017
@phantal
Copy link
Mannequin Author

phantal mannequin commented Jan 13, 2017

It doesn't cause any real problem with the tests, though. I routinely run with -j40 on my 2 cpu test box because the test run completes faster that way due to the way many tests spend time waiting for various things.

In my case it did because it caused enough file descriptors to be allocated that it hit the cap for max open file handles.

@kulikjak
Copy link
Contributor

kulikjak commented Dec 6, 2023

As per the current documentation, os.cpu_count() should return "number of logical CPUs in the system", and so returning 256 on a 32 core machine with 8 threads is correct.

We've previously hit the issue of hitting the cap on max open file handles as well and solved it by running the test suite with ulimit -n 16384.

@vstinner
Copy link
Member

vstinner commented Dec 6, 2023

os.cpu_count is problematic on sparc/solaris

The issue title is misleading. os.cpu_count() is correct. The problem is more that the default file descriptor is too low for a machine with 256 logical CPUs ("threads") when running ./python -m test -j0 which spawns os.process_cpu_count() + 2 worker processes.

You can specify the number of worker processes: ./python -m test -j20, or you can increase the file descriptor limit.

Is there anything to do on the Python side? Should regrtest limit the number of worker processes based on the current file descriptor limit? I have no idea how to compute a number of processes based on a maximum number of file descriptors.

@kulikjak
Copy link
Contributor

kulikjak commented Dec 6, 2023

Each core can execute 8 parallel tasks only in very specific circumstances. It's intended for use by things like lapack/atlas where you might be doing many computations on the same set of data.

Reading it again, I see what you mean - SPARC CPUs are not able to execute all those 8 threads at the same time. That said, Intel hyperthreaded cores are also presented as 2 logical cores, and they cannot do certain operations simultaneously (although SPARC is definitely more restricted in this).

Is there anything to do on the Python side? Should regrtest limit the number of worker processes based on the current file descriptor limit? I have no idea how to compute a number of processes based on a maximum number of file descriptors.

That would be very hard to do, considering that tests can then open additional files. I don't think that is really feasible when simply adjusting the ulimit or the possibility of running the test suite with -jX exists. (though OP might see it differently)

@vstinner
Copy link
Member

vstinner commented Dec 6, 2023

The bare minimum which can be easily done is to emit a warning if the FD limit looks too small compared to the -jN argument. Maybe with instructions on how to fix the issue (reduce -jN value or increase FD limit).

@kulikjak
Copy link
Contributor

kulikjak commented Dec 6, 2023

That is true. In our case, IIRC, we saw occasional issues when building with fd limit of 4096 with 514 worker processes. So I'd propose warning the user when the fd limit is less than 10 times the worker count?

@vstinner
Copy link
Member

vstinner commented Dec 6, 2023

On Linux, the default FD limit is 1024 but I never got such issue. Well, my laptop "only" has 12 logical CPUs. Which process is impacted by the issue? The main process which spawns and manages the 514 worker processes? Does Python even need a single FD for worker process?

@kulikjak
Copy link
Contributor

kulikjak commented Dec 20, 2023

Sorry for such a long delay.

I looked into this and apparently it wasn't the test suite, but pgo phase when this was an issue (back when pgo was running the entire test suite). Also, it seems that back then the default limit on our build machines might have been lower.

Anyway, when I change the limit a lot all the way to 512 and run the test suite, I see the following error:

Warning -- regrtest worker thread failed: Traceback (most recent call last):
  File "/builds/python37/Python-3.7.17/Lib/test/libregrtest/runtest_mp.py", line 284, in run
    mp_result = self._runtest(test_name)
  File "/builds/python37/Python-3.7.17/Lib/test/libregrtest/runtest_mp.py", line 249, in _runtest
    retcode, stdout, stderr = self._run_process(test_name)
  File "/builds/python37/Python-3.7.17/Lib/test/libregrtest/runtest_mp.py", line 194, in _run_process
    popen = run_test_in_subprocess(test_name, self.ns)
  File "/builds/python37/Python-3.7.17/Lib/test/libregrtest/runtest_mp.py", line 74, in run_test_in_subprocess
    **kw)
  File "/builds/python37/Python-3.7.17/Lib/subprocess.py", line 800, in __init__
    restore_signals, start_new_session)
  File "/builds/python37/Python-3.7.17/Lib/subprocess.py", line 1457, in _execute_child
    errpipe_read, errpipe_write = os.pipe()
OSError: [Errno 24] Too many open files

Kill <TestWorkerProcess #1 running test=test_support pid=44123 time=8.7 sec> process group
Kill <TestWorkerProcess #2 running test=test_fork1 pid=44801 time=4.4 sec> process group
...
Kill <TestWorkerProcess #175 running test=test_pickletools pid=45171 time=2.3 sec> process group
0:00:14 load avg: 33.55 Waiting for <TestWorkerProcess #95 running test=test_posix pid=45022 time=4.5 sec> thread for 1.1 sec

== Tests result: FAILURE ==

which looks like the main process, but I guess that e.g. subprocess tests might fail as well?

But even with limit of 512 it doesn't happen always (I guess that some tests finish very fast and free the resources before the limit is reached)

@encukou
Copy link
Member

encukou commented Sep 17, 2024

SPARC/Solaris is unsupported per PEP-11.

I'll close the bug.
Like other issues for unsupported platforms, it'll be listed on the GitHub project for anyone who wants to support the platform unofficially or bring support back.
I encourage such people to add themselves to the Experts list so we know whom to ping.

@encukou encukou closed this as not planned Won't fix, can't repro, duplicate, stale Sep 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
OS-unsupported stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
Projects
Status: Done
Development

No branches or pull requests

4 participants