-
Notifications
You must be signed in to change notification settings - Fork 424
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deadlock hang when using asyncio await on sklearn RandomForestClassifier #1605
Comments
Those are very old versions of scikit-learn and joblib. Those problems might have been fixed in the mean time but it's impossible to tell without a reproducer. If you switch to the latest stable versions, you should be able to inspect which backend is actually used by setting the verbose setting to a large value e.g. 10 or more. from joblib import parallel_config
parallel_config(verbose=10) on older versions, it might be possible to get verbose output by setting the verbose parameter of the I find it surprising that you suspect that it's using process-based parallelism because the Let's consider the following:
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from joblib import dump
iris = load_iris()
X, y = iris.data, iris.target
rf = RandomForestClassifier().fit(X, y)
dump(rf, 'rf.joblib')
from sklearn.datasets import load_iris
from joblib import load
iris = load_iris()
rf = load('rf.joblib')
rf.set_params(n_jobs=2, verbose=10)
rf.predict_proba(iris.data) When I execute the above code, I get:
so it's indeed using thread-based parallelism, as expected. When I edit the script to call
To get more debugging info, you can also use faulthandler in your original code to get a full traceback after a timeout (in case of deadlock). For instance, if you expect your prediction code to always last less than 10s, you wrap the prediction code as follows: from sklearn.datasets import load_iris
from joblib import load
from faulthandler import dump_traceback_later, cancel_dump_traceback_later
iris = load_iris()
rf = load('rf.joblib')
rf.set_params(n_jobs=2, verbose=10)
dump_traceback_later(timeout=10, exit=True)
rf.predict_proba(iris.data)
cancel_dump_traceback_later() and report the results here. See the official doc for details on faulthandler usage: |
Ideally, please provide a minimal reproducer along the lines of the code snippets I posted in the above comment but also adding the an asyncio await pattern similar to what you do in your setup. |
Thank so much for the For sure I can share code! I came up with this script to minimally reproduce how we are calling the model in our application. I replaced the production code with some testing code for ease of testing, but I included it just for completeness of conversation! import asyncio
from asyncio.events import AbstractEventLoop
from typing import BinaryIO
import joblib
import numpy as np
from past.utils import old_div
from sklearn.ensemble import RandomForestClassifier
import numpy.typing as npt
model_path = "/biz/baz/my_model.pkl"
def load_pickle(fname):
with open(fname, "rb") as fd:
obj = joblib.load(fd) # Based on your example, this is not necessary.
return obj
model_backend: RandomForestClassifier = load_pickle(model_path)
model_backend.n_jobs = 1 # Based on your example, I should be using `set_params` instead
clf_mean: npt.ArrayLike = model_backend.mean
clf_std: npt.ArrayLike = model_backend.std
model_backend.set_params(verbose=10)
async def predict(features: npt.ArrayLike) -> npt.ArrayLike:
m_feat = features - clf_mean.reshape(1, -1)
normalized_features = old_div(m_feat, clf_std.reshape(1, -1))
prob = model_backend.predict_proba(normalized_features)
return prob
NUM_TASKS = 15 # arbitrary
FEATURES_INPUT = np.random.rand(*(7_500, 226)) # This is the max we will see, but the hang can happen for a much smaller input size.
_semaphore = asyncio.Semaphore(5)
async def rate_limited_req():
async with _semaphore:
# Several other `await` calls to convert `req` into features...
ans = await predict(FEATURES_INPUT)
# Write `ans` back to a response Named Pipe
# def production_call_pattern(loop: AbstractEventLoop):
# # In our system, we actually invoke the model by responding to file writes:
# def _execute(req_pipe: BinaryIO):
# req = req_pipe.readline()
# asyncio.ensure_future(rate_limited_req(), loop)
# request_pipe = open("/foo/bar.0", "rb", buffering=0)
# loop.add_reader(request_pipe, _execute, request_pipe)
# loop.run_forever()
def testing_call_pattern(loop: AbstractEventLoop):
async def testing_create_tasks():
predict_tasks = [loop.create_task(rate_limited_req()) for _ in range(NUM_TASKS)]
[await task for task in predict_tasks]
loop.run_until_complete(testing_create_tasks())
def main():
loop = asyncio.get_event_loop()
# production_call_pattern(loop)
testing_call_pattern(loop)
main() Which outputs many times:
Please let me know if you have any questions! |
@ogrisel I saw the issue reproduce and got the full stack trace for it! Notice that I did a
From what I can tell, we are stuck It concerns me that the docs say:
Again, we are using EDIT:
Still testing possible solutions. |
Hey @ogrisel, We saw this happen, and luckily the suggestion with
In Line 915 in 5991350
This led me to an important discovery. As mentioned since 2009 in python/cpython#50970 (comment), it is known that if you This makes sense. In my case I have 2 Python processes that both call I found Delgan/loguru#231 (comment) who explains the issue really well and implemented a fix in Delgan/loguru@8c42511:
If a fix like this interests the |
EDIT: We confirmed in production that disabling So in reading KimiNewt/pyshark#26, we learned that by
We don't even care about the model_backend.verbose = False More reading:
|
Hey team!
I was debugging a Deadlock in our release candidate that only happens every ~3 days. I really wanted to provide some code but I'm not able to reproduce it 😞.
Our setup includes 2 Python processes which each independently use
joblib.load()
to load a pickledsklearn.ensemble.RandomForestClassifier
.Although it was trained with
n_jobs=-1
, when we actually call sklearn'spredict_proba
on it we hardcode the backend to ben_jobs=1
. This is our final model:I read through countless issues but the top 3 interesting things I learned were these:
Others were also seeing a hang issue with
sklearn
+joblib
as well, and the solution was to use thethreading
backend. Since I don't control thejoblib
part of the code, I simply discovered that that sklearn will use joblibs'Parallel
to execute myn_jobs=1
GitHub - scikit-learn#_forest.py#L953-L956 which will result in theSequentialBackend
being selected.joblib/joblib/_parallel_backends.py
Lines 423 to 426 in f70939c
joblib
'sParallel
will in turn make calls to aBatchCompletionCallBack
:joblib/joblib/parallel.py
Lines 1406 to 1418 in f70939c
Which eventually executes my job:
joblib/joblib/parallel.py
Lines 597 to 599 in f70939c
What I was surprised to see was that at this point, the
self._backend
has switched to theLokyBackend
. I found that it's because of these 2 pieces of code:joblib/joblib/parallel.py
Lines 1505 to 1508 in f70939c
joblib/joblib/_parallel_backends.py
Lines 237 to 239 in f70939c
This backend concerns me because I can see it is called with
self._backend_args['context']
ofmultiprocessing.context.ForkContext
and I know afork
can hang from numpy/numpy#9248 (comment)joblib/joblib/parallel.py
Lines 1261 to 1262 in f70939c
But as far as I can tell these args are completely ignored.
Either way, even if the backend is a
LokyBackend
at this point, thesklearn
code defines a function that does not even calljoblib
'sapply_async
, so I wouldn't expect it to hang. (ref: GitHub - scikit-learn#_forest.py#L720-L733)So I can't apply the fix in the sklearn issue, but I don't see how that could help me anyways.
This issue talks about a Hang that was fixed by changing the
dask
backend to stop using aclient.scatter
function. This reminds me to mention that we have fixed the hang by removing all "async" and "await" keywords for thesklearn
predict_proba
call. That doesn't mean we've removed it from theasyncio
loop. It's still there. However for some reason callingsklearn
from an async context is causing a hang.So this makes me ask, is it possible that there is another hold & wait somewhere in either the
SequentialBackend
andLokyBackend
I mentioned? I know it doesn't sound possible, but it's my current hypothesis...In our logs for the hang, I see
Process A
start thesklearn
call which _always finishes in < 1 second, and next I see
Process B**also** start a call to the model.
Process Bcontinues without a problem, but
Process Agets stuck until 30 minutes when
Process Bruns some totally unrelated
numpycode and
Process A` wakes up and finishes with no problem.Here I learned that
joblib
can share memory if the numpy arrays it loads are big enough. Could that cause a hold & wait deadlock scenario?Dependencies:
scikit-learn == 0.22
joblib == 1.2
platform=linux AL2 instance
Very sorry for the verbosity, but thank you very much for your help and I really appreciate any insight or ideas into this! 😃
cc @ogrisel if you have time! But please no worries if you don't 😄
The text was updated successfully, but these errors were encountered: