-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] Support multi-threading of LibLinear L1 one-vs-rest LogisticRegression for # classes > 2 #6448
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…63). This is needed for very large training sets. Feature indices (based on the number of distinct features), are unlikely to need 4 bytes per value, however.
Tweak comments
… multi-class liblinear classification.
…ti-class one-vs-rest invocations
are all the changes to liblinear coming from upstream? |
No, they're local modifications. There are liblinear forks on github, but I didn't find a convenient place to make changes, and actually wasn't 100% sure of the authoritative source. |
Liblinear has recently been patched to handle sample weights #5274. Yet it was smaller than this patch. |
To better understand what this patch does, it:
Since there already is an n_jobs parameter on the LogisticRegression class, and to give the API control over whether threads are used, and how many, I introduced a new parameter on the LR class called n_threads, since this parallelization may or may not be desirable in conjunction with the grid search parallelization provided by n_jobs. If other scikit-learn classes than LogisticRegression use liblinear, presumably the n_threads parameter is defaulting to 1, because it compiles fine. It's been tested many times on OS X and Ubuntu 15.10 on datasets that used to take over 7 days to train, and which now takes about 11 hours using 48 cores. I added a compile switch to turn off the pthreads dependency completely, in case there are compilation issues under some circumstances. There is also parallelization available for single runs of train_one, e.g. if one only has 2 target classes, but these are still somewhat experimental. The two best contenders in that space that I've found are Shotgun and Bundle CDN. But in my case, I'm happy with just parallelizing across different classes. |
Since as our docstring now states, |
This branch subsumes the changes made to support larger feature sparse arrays. I don't know of a clean, simple way of separating the two without creating a new fork.
I suspect that if this change is of interest, the maintainers may want to change how the functionality is interfaced in the API, which may require a new branch anyway.