Skip to content

[MRG] Support multi-threading of LibLinear L1 one-vs-rest LogisticRegression for # classes > 2 #6448

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 26 commits into from

Conversation

mannby
Copy link

@mannby mannby commented Feb 25, 2016

This branch subsumes the changes made to support larger feature sparse arrays. I don't know of a clean, simple way of separating the two without creating a new fork.
I suspect that if this change is of interest, the maintainers may want to change how the functionality is interfaced in the API, which may require a new branch anyway.

@agramfort
Copy link
Member

are all the changes to liblinear coming from upstream?

@mannby
Copy link
Author

mannby commented Feb 25, 2016

No, they're local modifications. There are liblinear forks on github, but I didn't find a convenient place to make changes, and actually wasn't 100% sure of the authoritative source.

@agramfort
Copy link
Member

agramfort commented Feb 26, 2016 via email

@TomDLT
Copy link
Member

TomDLT commented Feb 26, 2016

Liblinear has recently been patched to handle sample weights #5274. Yet it was smaller than this patch.

@mannby
Copy link
Author

mannby commented Feb 26, 2016

To better understand what this patch does, it:

  1. Adds wrapper functions, to start/join threads using pthreads, that support Windows, OS X and other Unixes, as well as wrappers for mutexes and semaphores.
  2. Targets only two liblinear solvers and cases where there are more than two classes.
  3. Saves memory by performing a transposition that only applies to these two solvers once instead of once per invocation of train_one.
  4. Starts up n threads, and has them chew away at the independent one-vs-rest training for each of the classes until they're all done.

Since there already is an n_jobs parameter on the LogisticRegression class, and to give the API control over whether threads are used, and how many, I introduced a new parameter on the LR class called n_threads, since this parallelization may or may not be desirable in conjunction with the grid search parallelization provided by n_jobs. If other scikit-learn classes than LogisticRegression use liblinear, presumably the n_threads parameter is defaulting to 1, because it compiles fine.

It's been tested many times on OS X and Ubuntu 15.10 on datasets that used to take over 7 days to train, and which now takes about 11 hours using 48 cores.

I added a compile switch to turn off the pthreads dependency completely, in case there are compilation issues under some circumstances.

There is also parallelization available for single runs of train_one, e.g. if one only has 2 target classes, but these are still somewhat experimental. The two best contenders in that space that I've found are Shotgun and Bundle CDN. But in my case, I'm happy with just parallelizing across different classes.

@mannby mannby changed the title Support multi-threading of LibLinear L1 one-vs-rest LogisticRegression for # classes > 2 [MRG] Support multi-threading of LibLinear L1 one-vs-rest LogisticRegression for # classes > 2 Sep 1, 2016
Base automatically changed from master to main January 22, 2021 10:49
@adrinjalali
Copy link
Member

Since as our docstring now states, liblinear is only good for small datasets, I don't think we need to add this as other solvers handle parallelism.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants